# Inference for high-dimensional linear mixed-effects models: A quasi-likelihood approach

Linear mixed-effects models are widely used in analyzing clustered or repeated measures data. We propose a quasi-likelihood approach for estimation and inference of the unknown parameters in linear mixed-effects models with high-dimensional fixed effects. The proposed method is applicable to general settings where the cluster sizes are possibly large or unbalanced. Regarding the fixed effects, we provide rate optimal estimators and valid inference procedures that are free of the assumptions on the specific structure of the variance components. Separately, rate optimal estimators of the variance components are derived under mild conditions. We prove that, under proper conditions, the convergence rate for estimating the variance components of the random effects does not depend on the accuracy of fixed effects estimation. Computationally, the algorithm involves convex optimization and is loop-free. The proposed method is assessed in various simulation settings and is applied to a real study regarding the associations between the body weight index and polymorphic markers in a heterogeneous stock mice population.

• 15 publications
• 1 publication
• 24 publications
08/30/2022

### Empirical Likelihood Inference of Variance Components in Linear Mixed-Effects Models

Linear mixed-effects models are widely used in analyzing repeated measur...
07/09/2020

### varTestnlme: Variance Components Testing in Linear and Nonlinear Mixed-effects Models

The issue of variance components testing arises naturally when building ...
03/22/2019

### Principal components in linear mixed models with general bulk

We study the outlier eigenvalues and eigenvectors in variance components...
05/14/2018

### An efficient and robust method for analyzing population pharmacokinetic data in genome-wide pharmacogenomic studies: a generalized estimating equation approach

Powerful array-based single-nucleotide polymorphism--typing platforms ha...
02/23/2022

### Adversarially-regularized mixed effects deep learning (ARMED) models for improved interpretability, performance, and generalization on clustered data

Data in the natural sciences frequently violate assumptions of independe...
04/16/2019

### Constraints in Random Effects Age-Period-Cohort Models

Random effects (RE) models have been widely used to study the contextual...
07/16/2020

### Coefficients of Determination for Mixed-Effects Models

In consistency with the law of total variance, the coefficient of determ...

## 1 Introduction

Data with cluster structures commonly arise from many fields, such as biology, genetics, and economics. Linear mixed-effects models provide a flexible tool for analyzing clustered data, which include repeated measures data, longitudinal data, and multilevel data (PB00; Goldstein93)

. The linear mixed-effects models incorporate both the fixed and random effects, where the random effects induce correlations among the observations within each cluster and accommodate the cluster structure. In many modern genomic and economic studies, the dimension of the covariates can be large and possibly much larger than the sample size. A variety of statistical models and approaches have been proposed and studied for analyzing high-dimensional data. However, most of them are restricted to dealing with independent observations, such as linear models and generalized linear models. Statistical inference with respect to high-dimensional linear mixed-effects models remains to be a challenging problem. Motivated by a study of the association between the body weight index (BMI) and polymorphic markers in a stock mice population, we consider estimation and inference of unknown parameters in high-dimensional mixed-effects models.

We first introduce the general setting of a linear mixed-effects model. For simplicity of presentation, we use the setting for clustered data to present our models and methods. For repeated measurement data, the repeated measures form a cluster. Let be the clustering index. For the

-th cluster, we are given a response vector

, a design matrix for the fixed effects , and a design matrix for the random effects , where is the size of the -th cluster. A linear mixed-effects model (LW82) can be typically written as

 yi=Xiβ∗+Ziγi+ϵi, i=1,…,n, (1)

where is the vector of true fixed effects, is the vector of the random effects of the -th cluster, and is the noise vector of the -th cluster. For , we assume and are independently distributed with mean zero and variance and , respectively. Detailed assumptions are given in Sections 2 and 3.

Many existing literatures on linear mixed-effects models assume that the number of the random effects and cluster size are fixed. Without special emphasis, we say a fixed-dimensional setting if , , and are all fixed numbers, and a high-dimensional setting if is large and possibly much larger than , where is the total sample size. We refer to and as random components.

### 1.1 Related literatures

In the fixed-dimensional setting, many methods have been proposed to jointly estimate the fixed effects and variance parameters. We refer to GD11

for a comprehensive review. Among them, the maximum likelihood estimators (MLE) and restricted MLEs are most popular for estimation and inference in linear mixed-effects models. Restricted MLEs can produce unbiased estimators of the variance components in the low-dimensional setting but it is not applicable in the high-dimensional setting. Furthermore, these likelihood-based estimators rely heavily on the normality assumptions of the random components. Computationally, maximizing the likelihood can generally lead to a nonconvex optimization problem which typically has multiple local maxima. Hence, the performance of likelihood-based methods are lack of guarantees in real applications. As an alternative,

SZT07

proposed moment estimators of the fixed effects and variance parameters for a random effect varying-coefficient model.

PY12 considered such moment estimators for fixed-dimensional linear mixed-effects models. Their proposed estimators have closed-form solutions and are computationally efficient. The consistency and asymptotic normality of these estimators are justified under certain conditions in the fixed-dimensional setting. Ahmn12 proposed another moment-based method for the estimation and selection of the variance components of the random effects in the fixed-dimensional setting. This method works especially well when the number of the random effects is as large as the cluster sizes, i.e. .

In terms of statistical inference on the fixed effects, the likelihood ratio, score, and Wald tests are broadly used. For the variance components, the aforementioned three methods (SL94; Lin97; VM03; D04)

are also available. However, when testing the existence of the random effects, the asymptotic distribution of the likelihood ratio is usually a mixture of chi-square distributions

(Miller77; Self87). Since theses methods are based on the MLEs or restricted MLEs as initial estimators, they also suffer from the drawbacks of likelihood-based methods discussed above.

In the high-dimensional setting, the problems are much more challenging. Assuming fixed cluster sizes, SBV10 analyzed the rate of convergence for the global maximizer of the -penalized likelihood with fixed designs. As mentioned before, however, the analysis for the global optimum may not apply to the realizations due to the existence of local maxima. FanLi12 studied the fixed effects and random effects selection in a high-dimensional linear mixed-effects model when cluster sizes are balanced, i.e. . The selection is conducted with concave penalties and minimum signal strength conditions regarding the fixed effects and the random effects are assumed for selection consistency. Bradic17 considered testing a single coefficient of the fixed effects in the high-dimensional linear mixed-effects models with fixed cluster sizes and sub-Gaussian designs. We mention that the theoretical analyses in all three aforementioned papers require the positive definiteness on the covariance matrix of the random effects. This condition takes prior knowledge on the existence of the random effects and can be hard to fulfill in applications. Moreover, the optimal convergence rate of parameter estimation remains unknown. In fact, the proposed estimators in SBV10 and Bradic17 may not be rate-optimal according to our upper and lower bound analysis.

The problems of estimation and inference of the fixed effects in linear mixed-effects models are well connected with the literature on high-dimensional linear models. Many penalized methods have been proposed for prediction, estimation, and variable selection in high-dimensional linear models; see, for example, Lasso; FL01; Zou06; CT07; MB10; Zhang10. Statistical inference on a low-dimensional component of high-dimensional regression coefficients has been considered and studied in linear models and generalized linear models with “debiased” estimators (ZZ14; van14; JM14). The idea of debiasing has also been studied and extended to solve other statistical problems, such as statistical inference in Cox models (Fang17) and simultaneous inference (Zhang17; Dezeure17).

### 1.2 Our contributions

We develop a method for inference for the unknown parameters in high-dimensional linear mixed-effects models with general applicability and computational efficiency. Tthe number of the random effects can be possibly large. The cluster sizes can be either fixed or large, balanced or unbalanced.

We highlight the following advances of the proposed methods in this paper. First, the proposed methods are computationally fast and stable. In fact, they are loop-free and the computations in every step are either analytic or convex. Second, the proposed estimator for the fixed effects is rate optimal from the minimax perspective under general conditions. Procedures are provided for the construction of valid confidence intervals for a single coordinate of the fixed effects. The proposed estimator is free of normality assumptions and free of the structural assumptions on the variance components. Third, we propose to estimate the variance components with any consistent estimators of the fixed effects. The convergence rate of the proposed estimator is shown to be minimax optimal and does not depend on the convergence rate of the fixed effects estimation.

Our analysis provides a novel angle for understanding and simplifying the linear mixed-effects models by approximating the true unknown covariance matrix of the random components with some simple proxy matrices. In this way, one separates the tasks of estimating the fixed effects and variance components and avoids the nuisance parameters in each of the optimizations. This improves computational efficiency and simplifies theoretical analysis.

### 1.3 Notations

Throughout the paper, we use to index the -th cluster and to index the -th observation in each cluster. Let , , , and be obtained by stacking vectors , , , and matrices underneath each other, respectively. Let be a block diagonal matrix with the -th block being . Let denote generic variance parameters. Let and be a block diagonal matrix with the -th block being . Let and , . Let

be the singular value decomposition of

, where are the singular values and and are orthonormal basis. Note that .

For a random variable

, define its sub-Gaussian norm as

 ∥u∥ψ2=supl≥1l−1/2E1/l[|u|l]  (sub-Gaussian norm).

For a random vector , define

 ∥U∥ψ2=sup∥v∥2=1,v∈Rn0∥⟨U,v⟩∥ψ2.

For a matrix , let and

be the upper and lower restricted eigenvalues:

 R+(A,c,S)=sup{∥Aα∥22∥αS∥22: 0≠α∈Rn2, ∥αSc∥1≤c∥αS∥1}. R−(A,c,S)=inf{∥Aα∥22∥αS∥22: 0≠α∈Rn2, ∥αSc∥1≤c∥αS∥1}.

If , means that is semi-positive definite and means that is positive definite. Let and denote the largest and smallest eigenvalues of , respectively. Let , where is the trace of matrix . Let denote a diagonal matrix with the -th diagonal element being .

Let denote some generic positive constants which can vary in different statements.

### 1.4 Organization of the rest of the paper

The rest of the paper is organized as follows. In Section 2, we introduce the proposed quasi-likelihood approach for the fixed effects inference and the main motivations for the current work. In Section 3, we provide theoretical analysis for the proposed inference procedures in Section 2. In Section 4, we introduce our proposal for estimating variance components and provide upper and lower bound analysis. In Section 5, we display the empirical performance of the proposed method in different simulation settings. In Section 6, we apply our proposed methods to a real study regarding the associations between the body weight index and polymorphic markers in a stock mice population. In Section 7, we bring more discussions to the current topic. Proofs are provided in the supplementary files.

## 2 Statistical inference for the fixed effects: the method

In many applications of the linear mixed-effects models, inference of the coefficients of the fixed effects are of main interest. In this section, we present our proposal for the fixed effects inference and describe its motivations. We assume that the fixed effects is sparse such that with unknown. Let denote the support of . We consider model (1) when , , and grow with and can be much larger than . The cluster sizes can be either fixed or grow with .

### 2.1 Motivations of the proposed method

For fixed effects estimation with respect to model (1), the main challenges are posed by the high-dimensionality of the fixed effects and the clustered structure of observations. Before developing a new method, it is helpful to understand how the cluster structure makes model (1) different from a high-dimensional linear model in terms of estimation and inference. Equivalently, we would like to study the consequences of misspecifying a linear mixed-effects model to a linear model. For this purpose, we consider applying an -penalized method for high-dimensional linear models, the Lasso (Lasso), directly to the observations generated from (1). Specifically, define a linear model estimator

 ^β(lm)=argminb∈Rp{12N∥y−Xb∥22+λ(lm)∥b∥1} (2)

for some tuning parameter . The properties of the Lasso have been well-established. In a typical analysis, the convergence rate of concerns the restricted isometries of the sample gram matrix, , the sparsity of the true coefficients, and the so-called “empirical process” part of the problem, . It is known that for linear models with row-wise independent sub-Gaussian , the “empirical process” part is of order , which gives the optimal rate of convergence in -norm. In the following proposition, we study the size of “empirical process” part when the true model is (1).

###### Proposition 2.1

(The rate of the Lasso for linear mixed-effects models). Suppose that the responses are generated with respect to model (1) and each row of is independently generated with covariance matrix conditioning on . Then for any fixed ,

 E[∣∣∣1NX⊤.,j(Zγ+ϵ)∣∣∣2∣∣Z] =σ2eN+(Σx|z)j,j∑ni=1miTr(Ψη∗Σiz)N2 (3)

If is positive definite and all have bounded eigenvalues, then the second term on the right hand side of (2.1) is . If it further holds that , i.e. and are correlated, then the third term can be . That is to say, the Lasso may not be rate optimal for clustered data if either grows, or, grow and and are correlated. On the other hand, if the and ’s are all constant, it is not hard to prove that the original Lasso is still rate optimal for model (1).

Therefore, proper methods need to be developed for high-dimensional linear mixed-effects models under general conditions on and and possibly allowing the correlation between and . For the th block, the covariance of the random components is , which includes unknown parameters. Instead, we define

 Σia=aZi(Zi)⊤+Imi

with some predetermined constant and be the block diagonal matrix with the -th block being . We show that can be viewed as a proxy matrix of , which serves as a weighting matrix in the likelihood function. The following proposition shows that this approximation is indeed valid up to some scaling constant.

###### Proposition 2.2

If is positive definite, then for any constant ,

 min{1σ2e,aΛmax(Ψη∗)}Σ−1a⪯Σ−1θ∗⪯max{1σ2e,aΛmin(Ψη∗)}Σ−1a.

Therefore, if has positive bounded eigenvalues, approximating with for any does not affect the convergence rate but some constants. FanLi12 derived a class of proxy matrices from the profile likelihood, where the same technique is considered in Bradic17. However, the analysis in Bradic17 requires the strictly positive definiteness of , which is hard to fulfill in real applications. We will analyze the proposed method without requiring positive definite .

Our idea is to estimate

using the linearly transformed observations

for . The problem is then to estimate and construct confidence intervals for in the following model,

 ya=Xaβ∗+Σ−1/2a(Zγ+ϵ),

However, even with carefully proposed proxy matrices, statistical inference regarding the fixed effects is a more challenging problem than in the linear models since the transformed data has correlated and heterogeneous noises. As we show in the next section, estimation of can be obtained via a weighted Lasso based on the above model. However, extra efforts are needed to justify the limiting distribution of the estimator for inference and to derive a consistent variance estimator for it. Our proposed variance estimator is adapted from Bu15 and Dezeure17, where the original proposal is for dealing with the misspecified linear models. We will show that the applicability of this type of robust variance estimators in this more challenging set-up.

### 2.2 The quasi-likelihood approach

Let be the block diagonal matrix with the -th block being . Let and denote the transformed observations such that .

First, we estimate the fixed effects via the following weighted Lasso procedure. For some fixed , define

 ^β=argminβ∈Rp{12Tr(Σ−1a)∥ya−Xaβ∥22+λp∑j=1ξj|βj|} (4)

for some tuning parameter and weights , . It reduces to a Lasso estimator if . In Section 5, we show that the proposed method is robust to different choices of via numerical experiments. We will study the convergence rate of and its optimality under proper conditions.

Given the task of making inference of , we propose the following debiased estimator.

 ^β(db)j =^βj+w⊤j(ya−Xa^β)w⊤j(Xa).,j, (5)

where

can be viewed as a correction score. It can be computed via another Lasso regression

(ZZ14; van14) or via a quadratic optimization (ZZ14; JM14). For computational convenience, we consider the Lasso approach. Define the correction score , where

 ^κj=argminκj∈Rp−1{12Tr(Σ−1a)∥∥(Xa).,j−(Xa).,−jκj∥∥22+λj∥κj∥1}, (6)

for some tuning parameter . A two-sided confidence interval for can be constructed as

 ^β(db)j+±zα/2√^Vj, (7)

where is the

-th quantile of a standard normal distribution and

is an estimator of the variance of . In fact, the asymptotic variance of involves nuisance parameters . Without relying on the specific structure of the variance components, we propose to use the following robust estimate:

 ^Vj=∑ni=1[(wij)⊤(yia−Xia^β)]2(w⊤j(Xa).,j)2, (8)

where is the initial Lasso estimator (4), is the -th sub-vector of such that , and is the -th sub-vector of . We will show that under mild conditions, is consistent.

## 3 Statistical inference of the fixed effects: theoretical guarantees

In this section, we move on to provide theoretical guarantees for the procedures described in Section 2.2. We first detail our assumptions.

###### Condition 3.1 (Sub-Gaussian random components)

The random noises , , are i.i.d. from a distribution with mean zero and variance . The random effects , are i.i.d. from a distribution with mean zero and covariance . is in the parameter space

 Θ∗={(η∗,σ2e):0<σ2e≤K1, 0⪯Ψη∗⪯K2Iq}

for some unknown positive constants and . , are i.i.d. sub-Gaussian with . Moreover, and are independent of and .

Classical linear mixed-effects models always assume Gaussian random components (PB00). However, this assumption can be restrictive in real applications. In Condition 3.1, the sub-Gaussian assumption only specifies the tail properties of random components, which is more robust to model misspecifications than classical assumptions. In addition, we do not require to be strictly positive definite.

Regarding the conditions on the designs, the estimation and inference in the linear mixed-effects models are usually conditioning on in order to maintain the cluster structure. SBV10 and FanLi12 assume both and are fixed. Jiang16 considers estimation and inference in a misspecified linear model when both and are random. Bradic17 assumes the design for the fixed effects is sub-Gaussian with mean zero and the design for the random effects is fixed, which implies that and are independent. We comment that the difference between fixed designs and random designs is relatively trivial. However, as suggested by Proposition 2.1, it is important to characterize the correlation between and under general cluster size conditions. In the current work, we assume that is row-wise independent sub-Gaussian conditioning on and the dependence of and is shown by the possibly nonzero .

###### Condition 3.2 (Sub-Gaussian X conditioning on Z)

Conditioning on , each row of is independent sub-Gaussian with covariance matrix such that and

 max1≤i≤n,1≤k≤mi∥∥(Xik,.−E[Xik,.|Z])Σ−1/2x|z∥∥ψ2≤K0.

The assumption of independent sub-Gaussian designs is commonly adopted in high-dimensional regression. In general, it is hard to deal with the correlation between and without assumptions of . In our theorems, we will characterize the effects of this conditional mean on the rates of convergence.

### 3.1 Fixed effects estimation

In this subsection, we analyze the performance of (4) under Conditions 3.1 and 3.2. Define a constant

 Ca=K0√σ2e+K2a. (9)
###### Lemma 3.1 (Fixed effects estimation with weighted Lasso)

For defined in (4), let us take a vector such that and for some positive constant . Assume that Condition 3.1 and 3.2 hold true. If for some large enough , then there exists a large enough such that for

, we have with probability at least

,

 ∥∥^β−β∗∥∥1≤C1sλ,   ∥∥^β−β∗∥∥2≤C2√sλ, and 1Tr(Σ−1a)∥∥Xa(^β−β∗)∥∥22≤C3sλ2, (10)

where are large enough constants.

Lemma 3.1 provides upper bounds for the prediction error and the estimation errors in -norm and -norm. In comparison to the standard results of linear models, a larger tuning parameter is needed for linear mixed-effects model as long as . The results of Lemma 3.1 hold for any such that and any positive constants . Also notice that larger choice of can lead to smaller , which should be favored. However, larger constant can lead to smaller restricted eigenvalues of . From the optimization perspective, can be either set as some fixed nonzero constant or be treated as a tuning parameter in the optimization (4). Lemma 3.1 also requires that the correlation between and is not too large such that .

###### Remark 3.1

If for some fixed , then a sufficient condition for is
.

###### Remark 3.2

For any constant ,

 Tr(Σ−1a)≥N−ac∗ac∗+1nq,

where .

Remark 3.1 is a direct result of Lemma 1.1 in the Supplemental Materials. Together with Remark 3.2, we see that this assumption gets milder as gets smaller. In the next subsection, we study the optimality of the proposed estimator.

### 3.2 Rate optimality of the proposed estimator

We study the optimality of proposed estimator for the fixed effects by restricting ourselves to

 Xik,.|Z∼i.i.d.N(E[Xik,.|Z],Σx|z), γi∼i.i.d.N(0,Ψη∗) and ϵi,k∼i.i.d.N(0,σ2e), (11)

. We consider the following parameter space. Define

 Ξ(s) ={ν=(β∗,θ∗,Σx|z):∥β∗∥0≤s, θ∗∈Θ∗, 1/K∗≤Λmin(Σx|z)≤Λmax(Σx|z)≤K∗<∞, E[X|Z]=0}, (12)

where . We see that (11) and (12) defines a special case of Condition 3.1 and 3.2. We prove the minimax optimal rate of convergence in -norm with respect to .

###### Theorem 3.1

(Minimax lower bounds for estimating the fixed effects). Suppose that (1) and (11) are true.
If for and , then there exists some constant such that for any constant choice of ,

Together with (10), in , is minimax rate optimal in -norm. In the proof, we use the minimax optimality of -penalized MLE, which has as the weighting matrix, and Proposition 2.2 to show the minimax optimality of the proposed estimator.

It has not been addressed so far the optimality of the proposed estimator when is singular. If is singular, where a special case is , we know the optimal rate for estimation in -norm satisfies . By Remark 3.2, the proposed estimator is rate optimal for singular as long as .

### 3.3 Statistical inference of the fixed effects

Debiased estimators can be used for statistical inference of linear combinations of the high-dimensional regression coefficients (ZZ14; van14; JM14). Under certain conditions, the debiased estimators are asymptotically normal and can be used to construct confidence interval with optimal lengths (CG17). Given the task of making inference of , we consider the debiased Lasso estimator proposed in (5). Next theorem provides the asymptotic normality of .

###### Theorem 3.2

(Asymptotic normality of the debiased Lasso estimator). Suppose that is a solution of (6) for some . Then under Conditions 3.1 and 3.2, for defined in (5), if (10) holds, , , and , then it holds that

 w⊤j(Xa).,j∥wj∥22(^β(db)j−β∗j)=R0,j+oP(1),

where for .

Theorem 3.2 demonstrates the asymptotic normality of the debiased Lasso estimator of . Moreover, the results of Theorem 3.2 does not depend on the sparsity condition of the population precision matrix of the design. If one would assume that, some technical conditions can be weakened (Theorem 2.4 of van14). We mention that the condition is for fulfilling the regularity conditions of the central limit theory.

If one has a consistent estimator of , confidence intervals can be constructed based on the limiting distribution of . However, estimating the nuisance parameters requires a knowledge on the structures of variance components and extra efforts. In the next Lemma, we consider a robust estimator of the variance of , defined in (8). The idea is to approximate

 ∑ni=1E[((wij)⊤(Σia)−1/2(Ziγi+ϵi))2|Z](w⊤j(Xa).,j)2

by its empirical version. The proposed variance estimator (8) has been considered in Bu15 and Dezeure17

to deal with model misspecification and heteroscedastic errors in linear models. In the next lemma, we show that it can serve as a convenient and consistent estimate of the unknown variance under mild conditions.

###### Lemma 3.2 (Convergence rate of the variance estimator)

Under Conditions 3.1 and 3.2, if (10) holds true, and , then

 ∣∣ ∣∣[w⊤j(Xa).,j]2∥wj∥22^Vj−V0,j∣∣ ∣∣=OP⎛⎜⎝a1/2n+anslogp+slogp√anTr(Σ−1a)⎞⎟⎠

for defined in (8).

Therefore, if , the proposed variance estimator is consistent under the conditions of Lemma 3.2. The proposed is robust in the sense that it does not rely on the specific structure of the variance components, but it is still consistent under mild conditions.

Based on Theorem 3.2 and Lemma 3.2, hypothesis testing and constructing confidence intervals are both achievable. Especially, we propose to construct confidence intervals according to (7), which have correct coverage probability asymptotically under the conditions of Theorem 3.2 and Lemma 3.2. We will demonstrate the performance of the proposed confidence interval in various simulation settings.

We conclude this section with a further comment on the benefits of using the quasi-likelihood. In fact, even if we have a consistent estimator of , say , using proxy matrix to compute the debiased estimator still has favorable aspects over using . First, using can bring complex dependence structure to as the correction score would also depend on the random components. This makes it difficult to justify the asymptotic normality of . Second, may not approximate well enough in the sense that the magnitude of the error in can be larger than the magnitude of the bias of the debiased Lasso estimator. As a result, the sample size condition for the asymptotic normality may be impaired.

## 4 Estimation of variance components

In this section, we consider estimating the unknown parameters of variance components. Specifically, we consider with linear structures, i.e.

 Ψη∗=d∑j=1η∗jGj∈Rq, (13)

where are symmetric basis matrices which are linearly independent in the sense that

 d∑j=1cjGj=0 iff c1=⋯=cd=0. (14)

The structure of (13) incorporates most commonly used models in applications, such as the random intercept model and the models used in twin or family studies (Wang11). One should note that any symmetric can be represented via (13) with being the vector of its upper diagonal elements.

### 4.1 Estimating variance components

We consider a moment estimator based on the quasi-likelihood function considered in the previous section. We investigate the following moment equation

 E[Σ−1/2a(y−Xβ∗)(y−Xβ∗)⊤Σ−1/2a|Z]=Σ−1/2aΣθ∗Σ−1/2a.

As it is a matrix equality, we use Frobenius norm in the loss function and replace the expectation with empirical realizations:

 g0(θ;a,β∗)=n∑i=1∥∥(Σia)−1/2[(yi−Xiβ∗)(yi−Xiβ∗)⊤−Σiθ](Σia)−1/2∥∥2F.

If is unknown, one can estimate by the minimizer of the above function with replaced by its estimate, say given in (4).

Since is biased, plugging in directly may substantially increase the bias of , which will be proved in the Appendix. In fact, it will be seen in Theorem 4.2 that the minimax optimal rate for estimating is independent of the accuracy of , as long as it is consistent. Our upper bound analysis matches the optimal rate under mild conditions but requires a sample splitting step. The clusters are splitted into two folds such that , , and . Let be an initial estimate of with the second half of the data . We compute the residuals for . Our proposed estimator of is

 ^θ =argminθ∈Θ∗g(θ;{ˆresi}i∈I1), (15)

where

 g(θ;{ˆresi}i∈I1)=∑i∈I1∥∥(Σia)−1/2[ˆresiˆres⊤i−Σiθ](Σia)−1/2∥∥2F. (16)

In practice, one can run another round of (15) with samples in the two folds switched and estimate with the average of two estimates. Computationally, (15) involves a convex optimization, which is loop-free and stable. The main ideas of our proposal are illustrated in the coming subsection.

### 4.2 Upper bound analysis

In this subsection, we analyze the proposed estimator (15) for estimating variance components. We assume the following technical condition.

###### Condition 4.1 (Positive definite Hessian matrix)

We assume is fixed and . Let be such that

 (17)

For , let is a version of which is computed only based on clusters in . It holds that for some constant .

In fact, is the Hessian matrix of the function (16) which is free of true parameters. We see that will be the Hessian matrix for the MLE at if is replaced by .

We comment on the last statement of Condition 4.1. We note that for any fixed dimension positive definite matrix and , there exists a positive constant such that . However, can be much larger than the other diagonal elements if . Loosely speaking, the last statement of Condition 4.1 is to assure that the eigenvalues of have the same order of magnitude as the diagonal elements of . This condition allows us to evaluate the concentration of correlated quadratic forms (i.e. the score function) separately.

###### Theorem 4.1

(Convergence rate for estimating variance components.) Let be computed via (15). Assume that . Under Conditions 3.1, 3.2, and 4.1, if , then

 ∥^η−η∗∥2=OP(max1≤j≤d(B(1)a)−1/2j,j) and (18) |^σ2e−σ2e|=OP((B(1)a)−1/2d+1,d+1+∥^u(2)∥22). (19)

We see that the convergence rate of is dominated by the parametric rate while the error of the fixed effects estimation does not play a role. In fact, our proof shows that the convergence rate of is same as the rate when is known, or equivalently, having a random effects model. On the other hand, the convergence rate of depends on the accuracy of fixed effect estimation. This can be roughly understood from the following moment equation. For now, let assume that is row-wise independent Gaussian with mean zero and covariance conditioning on . Then

 E[yiy⊤i|Z]=((β∗)⊤Σx|zβ∗+σ2e)Imi+d∑j=1η∗jZiGj(Zi)⊤.

Since is linearly independent of for , the error of the fixed effects estimation will affect the accuracy of estimating but not . For realized via (4), under the conditions of Lemma 3.1, we have .

Another condition required in Theorem 4.1 is which suggests weak dependence between two design matrices in real applications. In the simulation experiments presented in Section 5, we will carefully examine the effects of the correlation between and on the accuracy of statistical inference.

### 4.3 Rate optimality of estimating variance components

Now we turn to study the minimax lower bound for estimating variance parameters . We consider Gaussian designs and random components (11) and parameter space (12).

###### Theorem 4.2

(Minimax lower bounds for estimation of variance components.) Suppose that (1) and (11) are true. If for and for some , then there exists some constants such that

 inf