1 Introduction
Linear mixed effects models are widely used for analysis of clustered data in econometrics, biomedicine and other applied sciences. It consists of additional randomeffect components, along with the usual fixedeffects regression modeling, to account for variability among clusters. In biomedical applications, typical examples are longitudinal studies with repeated measurements within individuals and multicenter studies with patients clustered within centers. Due to recent technological advances, we often have access to sets of extremely highdimensional explanatory variables, typically socalled omics data, in such studies. Hence, the potential fixed effects variables are often in the order of millions even in studies with relatively few patients. Thus, we have to select the important fixedeffects variables from the vast pool of available variables under an ultrahigh dimensional setup. Note that, in most such studies the relevant random effect variables are typically few and hence, their selection is not necessary; all of them can be included in the model.
Mathematically, given groups (e.g., centers) indexed by , we observe responses in the th group, denoted by the
dimensional vector
. The associated fixed and random effect covariate values are, respectively denoted by the matrix and the matrix ; often is a subset of . Let denote the total number of observations. Then, the linear mixed model (LMM) is defined by the relation (Pinheiro and Bates, 2000)(1) 
where is the fixed effects (regression) coefficient vector, s are the random effects and s are random error components in the model. We assume that and , for each , and they are independent of each other and also of the s; here is a dimensional variance parameter that completely specifies the matrix and
denotes the identity matrix of order
. Then, given (and ), , independently for each , where . Stacking the variables in larger matrices, we can rewrite the LMM (1) as(2) 
where , and are the stacked matrices, and and . Now given , we have with . Given this model, our objective is to perform inference about the unknown regression parameter vector and the variance parameter vector . Following recent applications as described in the beginning, we assume that the number of fixedeffect variables () is much larger than the total sample size but the number of random effects , and prefixed. In particular, we assume the ultrahigh dimensional setup with for some , which is often the case with omics data analysis. Then the total number of parameters is and we need to impose some sparsity condition when estimating . This entitles a selection of important fixedeffect variables; we assume that the number of such variables is . Under such a sparsity assumption, the selection of the important fixedeffect variables and the corresponding parameter estimation are done by maximizing a suitably penalized loglikelihood function (Schelldorfer, Buhlmann and Van de Geer, 2011; Ghosh and Thoresen, 2018; Fan and Li, 2012); the resulting estimator of is known as the maximum penalized likelihood estimator (MPLE) and a brief description is provided in Section 2.
Under our ultrahigh dimensional regime, an important desired property of the MPLE is the oracle variable selection consistency which ensures that only and all the true important variables are selected asymptotically with probability tending to one. All the existing literature studies this property of the MPLE under the assumption of full exogeneity of the model i.e., independence between clusters and mutual independence between the covarites, the random effects and the error variable. However, these independence assumptions may not always hold in practice as already studied in detail for the classical lowdimensional settings along with appropriate remedies like the instrumental variable (IV) method; see, among many others, Ebbes et al. (2004, 2015), Kim and Frees (2007), Wooldridge (2010,2012), Bates et al. (2014). In our ultrahigh dimensional setup, it is practically too demanding to always expect all exogeneity assumptions to hold; in particular, the assumption regarding independence between the error and all the covarites is quite vulnerable and also not verifiable for extremely large values of . We will see, in Section 2, that the usual MPLE of the linear mixed model parameters gets seriously affected under violation of different exogeneity assumptions; it also significantly increases the number of false positives in fixed effects selections. To our knowledge, there is no literature on studying the effects of such endogeneity and developing appropriate remedies under highdimensional mixed models. This paper aims to tackle this important problem with particular focus on level1 endogeneity arising from the errorcovariate correlation and to propose a new consistent selection procedure of fixedeffect variables, along with estimation of all parameters, under such endogenity.
The endogeneity issue under high or ultrahigh dimensional setup was first considered in Fan and Liao (2014) under the usual regression setup where the authors proposed a focused generalized methodofmoments (FGMM) estimator to consistently select and estimate the nonzero regression coefficients. In this paper, we will extend their FGMM approach to consistently select the important fixed effect variables under our ultrahigh dimensional linear mixed model setup and then to estimate the variance parameters along with the (fixedeffects) regression coefficients of the selected variables in a second stage. The proposed method is shown to satisfy the oracle variable selection consistency for the fixed effects even under errorcovariate endogeneity. The overall procedure is implemented by an efficient algorithm and verified with suitable numerical illustrations.
The main contributions of the paper can be summarized as follows.

We investigate the effect of endogeneity on the selection of fixedeffects and parameter estimation in ultrahigh dimensional linear mixedeffects models (LMMs). This is indeed the first such attempt for mixed models with exponentially increasing number of fixedeffects covariates and we prove the inconsistency of the usual penalized likelihood procedures in such cases under violation of the exogeneity assumption.

We propose a new procedure for selecting important fixedeffects variables in presence of level1 (errorcovariate) endogeneity in such ultrahigh dimensional LMM. Our method is based on the profiled focused generalized method of moments (PFGMM). It handles the endogeneity issue through use of appropriate instrumental variables and it uses general nonconcave penalties like SCAD to carry out sparse variable selection. The problem of unknown variance components is solved by use of an appropriate proxy matrix as in Fan and Li (2012). Our proposal is seen to produce significantly less false positives, both in simulations and in a real data application, compared to the usual penalized likelihood method of Fan and Li (2012) in presence of endogeneity in data.

We rigorously prove the consistency of the parameter estimates of fixedeffects coefficients and their oracle variable selection property under appropriate verifiable conditions. Our assumptions on the penalty are completely general to cover most common nonconcave penalties like SCAD or MCP. The proof also allows the important selected variables to be endogenous, by allowing the instrumental variables to be completely external to the regression model.

We also prove, under appropriate conditions, an asymptotic normality result for the estimates of the fixedeffects coefficients obtained by our PFGMM. This will further help us to develop testing procedures in endogenous highdimensional LMM in the future.

An efficient computational algorithm is also discussed along with the practical issue of selecting the proxy matrix and the regularization parameter. Along with extensive numerical illustrations, goodtogo suggestions for their choices are also provided which are expected to work for most practical data with strong signaltonoise ratio. The (unoptimized) MATLAB code is available from the authors upon request.

Once the important fixedeffects variables are selected consistently, we also discuss and illustrate a few second stage estimation procedures to estimate the variance parameters of the LMM along with refinements of the fixedeffects coefficients.

Although our primary focus is on errorcovariate endogeneity, finally we also briefly illustrate the effects of other types of endogeneity (level2) on our proposed PFGMM approach of variable selection. Interestingly, our proposal is seen to work consistently in most such scenarios as well which we would like to investigate theoretically in our subsequent works.
The rest of the paper is organized as follows. We start with the description of the usual maximum penalized likelihood approach and its inconsistency in the presence of endogeneity in Section 2. In Section 3, we discuss the proposed PFGMM approach with its motivation, oracle consistency of variable selection and parameter estimation, asymptotic normality result and computational aspects with numerical illustrations. Estimation of the variance parameters in the second stage refinement are discussed and illustrated in Section 4. The effect of level2 endogeneity is examined numerically in Section 5 and a real data application is presented in Section 6. Finally the paper ends with some brief concluding remarks in Section 7.
2 The MPLE and Endogeneity: Types and Effects
Let us start with a brief description of the MPLE under the ultrahigh dimensional linear mixed model and study the effects of different types of endogeneity on it. Consider the notation and setup of Section 1. Using the normality of the stacked response vector in our LMM (2), the corresponding loglikelihood function of the parameters turns out to be
(3) 
Adding an appropriate penalty function, say , on each component of , the corresponding MPLE is defined as the minimizer of the penalized objective function given by
(4) 
With suitable regularization parameter , the MPLE obtained by the minimization of with respect to simultaneously selects the important components of (by estimating unimportant components of as zero) along with estimating the variance parameters consistently. However, the computation is a little tricky for different penalty functions and several extensions of basic coordinate descent methods have been proposed for the ultrahigh dimensional case. In particular, Schelldorfer, Buhlmann and Van de Geer (2011) have considered the penalty in (4) under highdimensionality, whereas Ghosh and Thoresen (2018) have extended the theory for general nonconcave penalties under both low and highdimensional setups. An alternative twostage approach has been proposed in Fan and Li (2012) which uses a proxy matrix in place of the unknown variance matrix and then maximize the resulting profile likelihood of only, with suitable penalizations, to select the important fixed effect variables. The problem of estimation and selection of random effect variables is considered in a second step.
Under certain assumptions, all the existing approaches to MPLE with different penalty functions are shown to satisfy the oracle variable selection consistency, i.e., they estimate exactly the true active set (set of nonzero regression coefficients) with probability tending to one. One crucial assumption is the independence of all the random effects, which is refereed to as the exogeneity assumption. This refer to the following three types of independence:

Clusters are independent, i.e., , and for any two different clusters .

Covariates are independent of the error terms, i.e., for all clusters . (Unit level exogeneity)

Covariates are independent of the random effect coefficients, i.e., for all clusters . (Cluster level exogeneity)
Although (1.) may often be true in most applications through proper design, (2.) and (3.) can be violated due to unknown underlying mechanisms and relationships between the variables under study; we refer to them, respectively, as the “unit level endogeneity” or “level1 endogeneity” and the “cluster level endogeneity” or “level2 endogeneity”.
We start with a numerical illustration of the effect of different types of endogeneity on the MPLE and corresponding variable selection methods. Here, we use the algorithm proposed by Ghosh and Thoresen (2018) with the famous SCAD penalty (Antoniadis and Fan, 2001; Fan and Li, 2001); other existing algorithms also indicate the same behavior of the MPLE under endogeneity and are skipped for brevity. More illustrations are provided in later sections.
Example 2.1. We simulate random samples from the LMM (1) with , for each (so that ), , , and the random effects coefficients having distribution , where . The design matrix has the first column as yielding the intercept, and the next
columns are chosen from a multivariate normal distribution with mean
and a covariance matrix having th element as for all ; the first two columns of correspond to the two random effect covariates and are kept nonpenalized. The true values of the parameters , and are , and for , whereas (correlated covariates) is considered. The regularization tuning parameter is chosen by minimizing the BIC for each replication separately whereas the parameter in the definition of the SCAD penalty has been kept fixed at .First, we repeat the above simulation 100 times without violating any exogeneity assumption and compute the summary measures about the performance of the MPLE as reported in Table 1. Next, to study the effects of endogenity, some covariates are made endogenous with either the error () or the th component of the random effect vector , respectively, through the transformations
These produce correlations of and , respectively, for the error or th randomeffect coefficients with the endogenous covariates. The summary performance measures of the resulting MPLE under such endogeneity are reported in Table 1 for (strong correlations of 0.688 and 0.698, respectively) and four particular sets of endogenous covariates: (i) Set 1: , i.e, 10 unimportant covariates are endogenous, (ii) Set 2: , i.e, 10 unimportant covariates and one important fixed effect covariate are endogenous, (iii) Set 3: , i.e, one important covariate that have both fixed effect component and random effect slope is endogenous along with 10 unimportant covariates, and (iv) Set 4: , i.e, all unimportant covariates are endogenous.
Endogenous  covariates  TP  PE  

No Endogeneity  
None  Mean  6.96  5.00  0.17  1.01  2.05  4.00  3.00  2.99  0.00  0.23  0.36  0.43 
SD  2.91  0.00  0.03  0.37  0.40  0.06  0.07  0.06  0.00  0.04  0.31  0.35  
MSE  –  –  –  0.1335  0.1604  0.0032  0.0047  0.0035  0.0000  0.0018  0.1375  0.1406  
Correlated with error (Level1 endogeneity)  
Set 1  Mean  10.41  5.00  0.03  0.87  1.99  4.00  3.00  2.98  0.00  0.05  0.44  0.35 
SD  1.64  0.00  0.01  0.40  0.37  0.03  0.04  0.03  0.00  0.01  0.31  0.27  
MSE  –  –  –  0.1718  0.1329  0.0011  0.0013  0.0012  0.0000  0.0406  0.1123  0.1144  
Set 2  Mean  10.13  5.00  0.03  0.90  2.02  4.00  2.98  3.03  0.00  0.05  0.35  0.39 
SD  1.95  0.00  0.01  0.34  0.38  0.03  0.03  0.01  0.00  0.01  0.28  0.30  
MSE  –  –  –  0.1222  0.1409  0.0010  0.0013  0.0008  0.0000  0.0420  0.1183  0.1187  
Set 3  Mean  8.45  5.00  0.03  0.86  2.06  3.98  3.00  2.99  0.00  0.05  0.46  0.40 
SD  1.50  0.00  0.01  0.32  0.34  0.03  0.03  0.03  0.00  0.01  0.32  0.28  
MSE  –  –  –  0.1250  0.1201  0.0010  0.0009  0.0009  0.0000  0.0414  0.1138  0.1063  
Set 4  Mean  23.45  5.00  0.01  0.86  2.00  3.99  3.00  2.99  0.00  0.01  0.49  0.39 
SD  4.64  0.00  0.00  0.38  0.34  0.01  0.02  0.02  0.00  0.00  0.35  0.28  
MSE  –  –  –  0.1607  0.1123  0.0003  0.0003  0.0003  0.0000  0.0586  0.1281  0.1100  
Correlated with random intercept (Level2 endogeneity)  
Set 1  Mean  7.28  4.96  0.55  1.00  2.00  3.96  2.96  2.98  0.00  0.67  0.45  0.67 
SD  2.93  0.40  3.84  0.39  0.42  0.40  0.30  0.31  0.00  4.46  0.40  2.07  
MSE  –  –  –  0.1487  0.1781  0.1636  0.0937  0.0929  0.0000  19.8562  0.1677  4.2397  
Set 2  Mean  7.21  5.00  0.17  0.99  2.01  4.00  3.00  3.00  0.00  0.23  0.43  0.44 
SD  2.54  0.00  0.03  0.33  0.32  0.06  0.06  0.01  0.00  0.04  0.37  0.39  
MSE  –  –  –  0.1075  0.1011  0.0039  0.0033  0.0001  0.0000  0.0018  0.1491  0.1672  
Set 3  Mean  5.24  4.64  5.97  0.98  2.08  3.52  2.64  2.64  0.00  6.64  0.51  0.42 
SD  1.63  0.98  15.98  0.63  0.36  1.31  0.98  0.98  0.00  17.61  0.84  0.34  
MSE  –  –  –  0.3983  0.1311  1.9233  1.0834  1.0829  0.0000  347.9638  0.6949  0.1325  
Set 4  Mean  12.47  4.28  7.66  0.98  1.62  3.29  2.46  2.43  0.00  8.80  0.43  4.69 
SD  6.78  1.54  16.27  0.36  0.82  1.55  1.16  1.15  0.00  18.63  0.71  9.74  
MSE  –  –  –  1.2489  3.2745  13.1795  6.9913  7.2183  0.1296  416.7764  0.5136  110.9365  
Correlated with random slope (Level2 endogeneity)  
Set 1  Mean  7.70  5.00  0.17  1.02  1.95  4.00  3.01  2.98  0.00  0.24  0.39  0.42 
SD  3.58  0.00  0.03  0.32  0.38  0.08  0.07  0.06  0.00  0.04  0.31  0.34  
MSE  –  –  –  1.1992  3.9375  15.9995  8.6771  8.9100  0.1296  0.0019  0.1245  0.1353  
Set 2  Mean  7.51  5.00  0.16  0.98  1.96  4.00  2.99  3.00  0.00  0.22  0.43  0.34 
SD  2.63  0.00  0.03  0.40  0.35  0.07  0.06  0.01  0.00  0.04  0.32  0.24  
MSE  –  –  –  1.2623  3.9604  16.0081  8.5702  8.9975  0.1296  0.0020  0.1208  0.1060  
Set 3  Mean  5.28  4.72  4.75  0.85  1.94  3.64  2.73  2.74  0.00  5.31  0.51  0.41 
SD  1.54  0.90  14.70  0.63  0.42  1.15  0.86  0.87  0.00  16.29  0.70  0.34  
MSE  –  –  –  0.4140  0.1764  1.4431  0.8130  0.8138  0.0000  288.3149  0.4839  0.1350  
Set 4  Mean  12.64  4.20  8.69  1.02  1.59  3.20  2.39  2.37  0.00  10.01  0.55  5.62 
SD  7.24  1.61  17.43  0.58  0.85  1.61  1.20  1.19  0.00  19.98  0.73  11.63  
MSE  –  –  –  1.4746  3.2622  12.7758  6.9904  7.0482  0.1296  490.5430  0.5336  159.4848 
The major observations from Table 1 and other similar simulations, not reported here for brevity, can be summarized as follows.

Under endogeneity, we have a significant increase in false positives compared to the ideal exogenous case. The number of such wrongly selected fixed effect variables further increases with the strength of endogeneity and/or number of endogenous variables. Such an effect is more serious for level1 endogeneity compared to level2 endogeneity.

Under level1 endogeneity, we are not expected to loose any truly significant fixedeffect variables. But, in some cases of level2 endogeneity, we may loose true positives as well.

The model prediction error is reduced in presence of level1 endogeneity, since more variables are selected in the final model. However, for level2 endogeneity, the model prediction error can increase significantly when we loose the few true positives.

The estimates of the fixedeffect intercept are more affected (increased bias and MSE) for level1 endogeneity, whereas the estimates of other fixedeffects are affected more by level2 endogeneity.

The error variance also becomes severely underestimated in presence of level1 endogeneity. Level2 endogeneity has a mixed effect in this case, producing significantly overestimated values of for some cases with higher degrees of endogeneity.

As known, the random effect variances are generally underestimated even under the ideal exogenous conditions. The effect of endogeneity on them is not very clear but always moderate except for level2 endogeneity with a lot of variables (Set 4).
As our motivation is to select the important fixedeffect variables from a large pool of available candidates, in summary, the effect of level1 endogeneity is more serious and needs proper treatment to decrease the false positives; on the other hand, level2 endogeneity needs to be controlled to ensure no loss in true positives.
Having an idea of the effect of different types of endogeneity on the MPLE, we can now investigate further the underlying mechanism of these effects with theoretical justification. In this regard, we will first present a set of necessary conditions for the MPLE under the LMM (1) to be consistent for both estimation and fixedeffects selection. Violation of some of these necessary conditions under endogeneity in turn leads to inconsistency of the MPLE.
Theorem 2.1 (Necessary conditions for variable selection consistency)
Consider a general loss function
(need not to be likelihood loss) for the LMM (1) and a general penalty . Further, assume the following.
is twice differentiable with respect to its arguments and the maximum of its second derivatives at the true parameter value is .

There is a local minimizer and of the penalized objective function
which satisfies , and ( is the dimension of ), as .

The penalty function is nonnegative with , is nonincreasing on for some , and .
Then, for any , we have
Proof: Define . Then, by an application of the KarushKuhnTucker (KKT) condition on the local minimizers and , we get
where if , and if . Therefore, by using the monotonicity and the limit of from Condition 1, we get
(5) 
Next, by the first order Taylor series expansion of at around , we get a on the line segment joining and such that
Therefore, in the event having probability tending to one (by Condition 2), we get
where the last step follows by CauchySwartz inequality. Now, by Conditions 1 and 2, we get
Then the theorem follows using (5)
Note that, Condition 3 in the above theorem, imposed on the penalty function, is the same as the one used in Theorem 2.1 of Fan and Liao (2014) and is quite general. It is satisfied by most common penalties including , SCAD or MCP by appropriately choosing the regularization parameter going to zero asymptotically ( as ). Therefore, as in Fan and Liao (2014), our result in Theorem 2.1 above also provides a necessary condition on the loss function for a large class of useful penalty functions. As a special case, our next result proves the inconsistency of the PMLE, in at least one aspect, under endogeneous LMM.
Theorem 2.2 (Inconsistency of penalized MLE)
Consider i.i.d. data and assume there is endogeneity in at least one in at least one group . The loss function of the penalized MLE (PMLE) is given by (3). Assume that and has finite 4th order moments and that satisfies the condition of Theorem 2.1. If and are local optimizers of the PMLE objective function such that ( is the dimension of ), then either , or .
Proof: The proof follows directly from Theorem 2.1 by noting the fact that, if is endogeneous, then
for some
. The proof of this fact can be done by standard calculation of the derivative and applying the Strong Law of Large Numbers; hence it is omitted for brevity.
3 Focused Selection of the Fixed Effect Variables
Consider the linear mixed model setup as described in Section 1
. We will now propose a new extension of the MPLE of Fan and Li (2012) that will lead to consistent oracle selection of important fixed effect variables. For this purpose, we will use the concept of the focused generalized method of moments (FGMM) approach, with nonconcave penalization, as proposed initially by Fan and Liao (2014) in the context of highdimensional linear regression; the resulting loss function indeed simultaneously performs sparse selection and applies the IV (Instrumental Variable) method against endogeneity. The IV method basically assumes the availability of a vector of observable
instrumental variables which is correlated with the covariates but uncorrelated with the model error, i.e.,The choice of a proper IV helps us to tackle different statistical problems; they are often chosen as a function of the covariates or even a subset of and hence the above condition can be easily verified through some simple moment conditions. As noted earlier, the IV technique is seen to be extremely useful to address the endogeneity issues in classical lowdimensional LMM; see Hall and Horowitz (2005), Wooldridge (2010), Lin et al. (2015), Chesher and Rosen (2017) for details on popular recent IV methods. Here we will extend them to the highdimensional LMM in the line of Fan and Liao (2014).
3.1 The Profiled Focused GMM (PFGMM) with nonconcave penalization
Under the linear mixed model setup considered in this paper, keeping consistent with the many reallife applications, we have assumed that the number of random effects are small enough so that their individual analysis is possible in the classical sense. Hence, we will assume that the matrix is positive definite (pd). Let us first assume, for the time being, that the variance parameter is known. Then, based on (3), the likelihood of only the fixed effects coefficients (ignoring the constant term) is given by
(6) 
Note that, this is also the profiled likelihood of obtained by substituting the MLE of the random effect vector , given , in the joint likelihood of and given covariates; see Fan and Li (2012) for details. A penalized version of this profile likelihood (in logarithm) can be maximized for sparse selection of the fixed effects and estimation of the corresponding coefficients; Fan and Li (2012) have suggested to use a suitable proxy matrix for the unknown matrix .
Now, in case of endogeneity, we need to additionally apply the IV method to achieve consistency. Let us again assume, for the time being, that and hence is known. Let us define the transformed variables
Then we have and hence . Therefore, the profile likelihood of , given in (6), under the linear mixed effects model (1) is also the ordinary likelihood of under the following linear regression model in the transformed space:
(7) 
Under level1 endogeneity in the mixed model (1), we also have endogeneity in the transformed regression (7) with . Noting that (7) is exactly the same as considered by Fan and Liao (2014), our idea is to apply their FGMM approach to this transformed model in the transformed space and then go back to the original space of the data to achieve our goal of fixed effects selection in the LMM (1) with endogeneity. This would have been straightforward if the original data were independent and identically distributed (iid) and was known, but none of these conditions hold in practice. Hence, we need appropriate nontrivial extensions to handle the implementation and theoretical derivations. Let us start with defining our proposed loss function below.
Note that the components of are iid and those of
are independent with the same variance but different means. Let us denote the corresponding random variables in the transformed space by
and , respectively. Then, obtaining a consistent solution under endogeneity is based on the availability of an appropriate set of observable instrumental variables in the transformed space such thatFan and Liao (2014) achieved variable selection consistency under endogeneity through overidentification by use of two sets of sieve functions (Chen, 2007), say, and where and are scalar functions. Letting denote the index set of true nonzero coefficients, the above condition of IV implies that, for , we have the following set of overidentified equations:
(8) 
Under these conditions, Fan and Liao (2014) have proposed to consider the FGMM loss function
(9)  
(10) 
where for all and is a diagonal weight matrix with nonzero weights corresponding only to the nonzero components of . In particular, the nonzero weights of components can be chosen as the inverse of the estimated variances of and , respectively. Then, a consistent solution of the transformed problem can be obtained by minimizing the penalized FGMM loss function; see Fan and Liao (2014) for details.
Now, let us look back at our original problem of the linear mixed model (1) and map the FGMM loss function (10) back into our data space to get a clear idea for this case. Note that, through an inverse transformation, we can assume for some IV in the data space. Hence the FGMM loss function for our mixed model setup turns out to have the form
(11) 
Note that, in practice with mixed models, we can not directly minimize this FGMM loss function or its penalized version because it depends on the unknown variance parameters through . To avoid this problem, we follow the approach of Fan and Li (2012) and propose to use in place of , where is some suitable proxy matrix for the unknown variance component matrix . Therefore, we finally minimize, with respect to , the penalized objective function
(12) 
where
(13) 
We will refer to as the profiled Focused GMM (PFGMM) loss function based on its link to the profile likelihood. If we would have used for known variance parameters, the asymptotic consistency results for the resulting estimator would have followed directly from the results of Fan and Liao (2014). However, we will here prove that, even using the proxy matrix , we can still achieve variable selection consistency under the linear mixed model provided the proxy matrix is not very far away; we will present the rigorous proof along with the necessary assumptions in the next subsection.
3.2 Oracle Variable Selection Consistency
Consider the setup of the previous subsection and assume that the true parameter value is the unique solution of the set of overidentified IV equations in (8), where the nonzero component vector . Further, we need the following sets of assumptions.
Assumptions on the penalty (P):
The general penalty function satisfies

is concave and nondecreasing on , with ,

has continuous derivative on , with ,
where denotes the strength of the signal, 
There exists a constant such that , where
(14)
It is worthwhile to note that Conditions (P1)–(P3) are quite standard in highdimensional analysis and used by several authors including Fan and Liao (2014). These are satisfied by a large class of foldedconcave penalties including with , hardthresholding, SCAD and MCP for appropriately chosen tuning parameters. Also for any , by the concavity of the penalty functions. Condition (P2) is related to the signal strength, on which we need the following additional assumptions depending on the dimension of the problem; these are needed to ensure variable selection consistency and are satisfied also by properly chosen SCAD and MCP penalties for strong signal and small .
Assumptions on the dimension and signal strength (A):

, , .

and
Next we assume the following conditions on the instrumental variables and with the notation and for . These are motivated from Fan and Liao (2014) and similar justifications hold for their selection in our setup; see Remark 4.1 there.
Assumptions on the Instruments (I):

There exists such that

Var and Var are bounded away from both zero and infinity uniformly in and .

and are bounded away from zero.

There exists constants such that and , where
(15) 
There exists a constant such that , where
(16)
Note that, Assumptions (I4) and (I5) depend on the choice of proxy matrix . We further need the following additional assumptions regarding .
Assumptions on the Proxy Matrix (M):

There exists a constant such that
and . 
.
Then we have the following main theorem.
Theorem 3.1
Suppose and Assumptions (P), (A), (I) and (M) hold under the linear mixed model setup. Then, there exists a local minimizer of in (12) that satisfies

. In addition, the local minimizer is strict with probability arbitrarily close to one for all sufficiently large .

Assuming denotes the estimated active set, .

For any unit vector ,
where and .
For simplicity in presentation, we have deferred the proof of the above theorem to Appendix A and continue with implementation details and applications of our proposal below.
3.3 Computational Aspects
For implementing the proposed PFGMM algorithm, one can follow the same algorithm as used by Fan and Liao (2014) on the transformed variables and . But, these transformed variables leading to the loss function in (11) are not known, so we need to use the proxy matrix and the approximated loss given in (13). So, given a proxy matrix , we first compute the matrix and its square root . For computation of the matrix square root we use the Blocked Schur algorithm as developed by Deadman et al. (2013); its implementation can be found in standard statistical softwares like MATLAB or R (function named ‘sqrtm’ in both). Then, the approximation of the transformed variables are computed as and Following this, the FGMM loss function based on and is nothing but the proposed loss in (13) and hence we can next device an algorithm following the Fan and Liao (2014) approach. The minimization of the resulting penalized PFGMM objective function is done through the iterative coordinate algorithm applied to a smoothed version of a nonsmooth PFGMM minimization problem. More details and justifications can be found in Fan and Liao (2014); for brevity, we only present the crucial considerations regarding the choice of proxy matrix and the choice of regularization parameter in our context.
On the Choice of Proxy matrix:
The choice of proxy matrix is not straightforward from the assumed conditions
but some light can be shed following the discussions in Fan and Li (2012, Section 2.3).
In particular, assuming standard nonsingularity conditions involving the random effects covariates ,
one possible choice of which can be obtained for large is times the identity matrix.
We have used this particular proxy matrix for all our empirical illustrations in the present paper.
On the Choice of :
Although for the regression modeling as considered in Fan and Liao (2014),
the regularization parameter can be chosen by crossvalidation
and hence can also be used in connection with any loss function other than likelihoodloss (like the FGMM loss),
it is not ideal to apply the crossvalidation technique in case of mixed models.
In likelihood based estimation and variable selection in high or ultrahigh dimensional mixedeffects models,
the usual proposal is to choose corresponding to the minimum BIC given by
(Schelldorfer et al., 2011; Delattre et al., 2014; Ghosh and Thoresen, 2018)
When we are using the proposed PFGMM loss function to estimate in the ultrahigh dimensional mixed model, one can define a natural extension of BIC as
where is the estimate of obtained through the proposed PFGMM approach with regularization parameter . But, in this context, it may be questionable if the above formulations provide the correct penalty and this clearly needs further detailed investigation. However, it has been observed that the simple choice of , as suggested in Fan and Liao (2014), works sufficiently well for all our numerical studies.
Sizes of the estimated active set by the PFGMM (blue) method and the PLS method (blue+orange) for different extents of endogeneity and correlated covariates (the standard errors are shown by respective error bars)
3.4 Empirical illustrations
We consider the same simulation setup as in Example 2.1 with level1 endogeneity and different values of the underlying parameters and apply the proposed PFGMM algorithm to select the relevant fixedeffects variables. In particular, we consider the true values of as representing strong signal, and the values of other parameters as and as in Example 2.1. Further, we consider two values of ; 0 and , indicating uncorrelated and correlated covariates, respectively, and different values of to represent varying strength of endogeneity (correlations being 0, 0.1, 0.24, 0.51, 0.69, respectively); note that gives the ideal case with no endogeneity. We have also studied negative values of with the same magnitudes leading to negative correlations, but their effects are the same as the positive cases (only depends on the magnitudes) and hence the results are not reported in the paper for brevity. For comparison, we also apply the profile likelihood proposal (referred to here as the PLS) of Fan and Li (2012). The average sizes of the estimated active sets obtained by both methods are presented in Figures 1 and 2, respectively, for the correlated and the independent covariate cases.
These empirical illustrations clearly show the significantly improved performance of the proposed PFGMM method under level1 endogeneity. In particular, for correlated covariates, even a small amount of endogeneity (small ) increases the estimated active set sizes in the PLS method, which becomes further damaging for larger extent of endogeneity, either through more endogenous variables or higher values of . On the other hand, the proposed PFGMM method produces an active set of size almost equal to the original active set size (5) under any extent of endogeneity and the variation over different replications is also negligible compared to the PLS method. The results for the independent variables are also similar although the harmful effect of endogeneity on the PLS method is not significant for smaller values of . The proposed PFGMM method still performs better than the PLS method overall, producing the same sets of active variables in the cases where the PLS also performs well.
The next section describes the performance of the estimated regression coefficients obtained through PFGMM, PLS and further refinements along with proposals for variance component estimation.
4 Estimation of the Variance Parameters
Once we have selected the important fixed effect variables consistently through the proposed PFGMM algorithm, our problem reduces to a low dimensional one. Let denote the set of indices of estimated nonzero coefficients, which is asymptotically the same as the true active set with probability tending to one from Theorem 3.1. So, we now have the reduced model
(17) 
Also, we have an estimate of , which is consistent and asymptotically normal from Theorem 3.1. Then, the most straightforward and intuitive estimates of can be obtained by applying the Maximum likelihood method to the resulting residual (random effect) model
(18) 
We will refer to the resulting estimator, say as the PFGMM estimator of in the line of the associated PFGMM estimator of . It is important to note that the PFGMME of will also be consistent and asymptotically normal by standard results on likelihood based inference for the lowdimensional residual model (18). Once has been computed as described in Section 3, the PFGMME of can be computed routinely by using the available software packages for lowdimensional LMM (e.g., package ‘lme4’ in R, function ‘fitlme’ in MATLAB).
Alternatively, if we just want to use the proposed PFGMM for selection of important fixed effects, in the second stage we can also finetune the estimates of along with estimation of to achieve better finite sample efficiency. For this purpose, we consider the reduced lowdimensional linear mixed effect model given in (17), containing only the fixedeffect variables from selected by the PFGMM algorithm, and apply the standard maximum likelihood (ML) or the restricted maximum likelihood (REML) approach to get the new estimates of and of . Let us refer to the resulting estimators