 # Estimation of marginal model with subgroup auxiliary information

Marginal model is a popular instrument for studying longitudinal data and cluster data. This paper investigates the estimator of marginal model with subgroup auxiliary information. To marginal model, we propose a new type of auxiliary information, and combine them with the traditional estimating equations of the quadratic inference function (QIF) method based on the generalized method of moments (GMM). Thus obtaining a more efficient estimator. The asymptotic normality and the test statistics of the proposed estimator are established. The theoretical result shows that the estimator with subgroup information is more efficient than the conventional QIF one. Simulation studies are carried out to examine the performance of the proposed method under finite sample. We apply the proposed method to a real data for illustration.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Longitudinal data or cluster data exists commonly in many fields, such as biomedical, economics and so on. For longitudinal data, the unknown correlation structure within different measurements of the same subject brings many troubles to the analysis of this type of data. If the within-subject correlation is ignored and all observations are treated independently, the inference result may be inaccurate. As an extension of the generalized linear models (Nelder and Wedderburn, 1972; McCullagh and Nelder, 1989), Liang and Zeger (1986) proposed a kind of marginal model, which just make model assumption on the conditional expectation and variance of each component of the response given the covariates without considering the correlation structure. And they suggested to use the generalized estimating equations (GEE) method to estimate the parameters involved in this model. The score type estimating equations of the GEE method were obtained based on the working correlation matrix, which refers to the assumed conditional correlation structure among different components of the response vector.

The working correlation matrix usually contains an unknown nuisance parameter set which also needs to be estimated during the estimating procedure of GEE method. When the working correlation matrix is misspecified, although the GEE estimator is still consistent, it may suffer from loss of efficiency. Qu et al. (2000) introduced a more efficient estimator obtained by the quadratic inference functions (QIF) method. They approximated the inverse of the working correlation matrix as a linear combination of some known basis matrices and constructed a quadratic inference function based on those matrices. Minimizing this quadratic function, the optimal solution is the QIF estimator. Since the nuisance parameters in working correlation matrix are not included in the quadratic function, the QIF estimator still performs well in efficiency under the misspecified case. QIF method has been widely used in many models. Qu and Li (2006) studied the estimation of varying coefficient model with the QIF method. Li and Yin (2009) applied the QIF method to the accelerated failure time model with multivariate failure time. Li et al. (2016) investigated the QIF estimator of the marginal additive hazards model with cluster failure time.

Recently, how to apply the information from other sources to improve the efficiency of statistical inference is becoming a research focus, especially for the combination of the individual-level data and the summary-level information which can be obtained from other studies. Auxiliary information method is one popular approach, which stands for the information with specific form provided by other datasets. For example, using covariate-specific disease prevalence information, that is the conditional probability of disease prevalence under different levels of covariate, as auxiliary information, Qin et al. (2015) obtained more efficient estimator to Logistic regression model in case-control studies. Chatterjee et al. (2016) developed the auxiliary information to regression models. They calculated the efficient likelihood estimator of parameter in regression model by incorporating the summary-level information from external big data with the likelihood function and extended their method to the case that the distribution of covariates in the internal data is different from that of the external data. Huang et al. (2016) proposed a double empirical likelihood estimator of the regression parameter in Cox’s proportional hazards model which synthesizes the

-year survival probabilities as auxiliary information. Compared with the conventional partial likelihood estimator, the efficiency of the double empirical likelihood estimator has been improved significantly with the subgroup information.

In this paper, to improve the efficiency of parameter estimator in the marginal model, we add a new type of auxiliary information into the procedure of parameter estimation based on the GMM (Hansen, 1982) method. Different from the previous researches about the auxiliary information in regression models, we construct the estimator with auxiliary information from the estimating equations rather than the likelihood function. In addition, previous studies are mainly focus on the one dimensional independent response case, and we explore the multivariate correlated case directly.

The rest of the article is organized as follows. In section 2, we introduce the main results in this paper, which includes the marginal model as well as its properties, the proposed auxiliary information and the estimation procedure with auxiliary information by the GMM method. The asymptotic properties of estimator based on the procedures is also presented. The simulation studies are shown in section 3. And we illustrate our proposed procedures with a real data example in section 4. A brief discussion is given in Section 5. And the proofs of the theorems are in the Appendix.

## 2 Main Results

### 2.1 Marginal Model and Auxiliary Information

In this paper, we just consider the case of longitudinal data, while the estimation procedure of cluster data is similar. For and , let be the response vector of the th subject, be the th observation of the -dimensional covariate of the th subject, thus represents a covariate matrix of the th subject. Without loss of generality, we assume that observations among different subjects are independent. The marginal model takes the form of

 h(μij)=\bf XTijβ, νij=ψv(μij),i=1,⋯,n;j=1,⋯,q,

where , and is the parameter vector of interest. In addition, is the scalar parameter, and are known link functions.

Qu et al. (2000) proposed to estimate by the QIF method. They expressed the inverse of the working correlation matrix as

 \bf R−1(α)=L∑l=1αl\bf Ml,

where is the nuisance parameter vector and is a set of known basis matrices. Based on those basis matrices, they obtained the following estimating equation

 (1)

where is the partial derivative of with respect to , is the conditional mean vector and is a diagonal matrix with each entry as the marginal conditional variance, . The QIF estimator of is calculated by minimizing the quadratic inference function

where .

To marginal model, we suggest a new type of auxiliary information. Let be a partition of , which is the range space of covariate X. If the conditional expectation of the response in subgroups are provided, we could consider them as auxiliary information. In fact, the specific expression of the auxiliary information is

 E(\bf Y|\bf X∈Ωk)=ϕk,k=1,⋯,K.

Now, we change the auxiliary information to the form of estimating equations. By double expectation, satisfies

 E[I(\bf X∈Ωk){E(\bf Y|\bf X% )−ϕk}]=0,k=1,⋯,K. (2)

Define , (2) is equivalent to

 E{Ψk(β,\bf X)}=0,k=1,⋯,K. (3)

In the following part, we will incorporate equation (3) into the estimate procedure.

### 2.2 GMM Estimator with Auxiliary Information

Noting (3), we have

 1nn∑i=1Ψk(β,\bf X% i)=0,k=1,⋯,K. (4)

Combing the estimating equations (4) with (1), we have

 \bf gn(β) = 1nn∑i=1\bf g(β,% \bf Xi) (5) =

where .

It is obvious that the number of estimating equations in (5) is , which is larger than the dimension of parameter vector . As it is stated in Hansen (1982), instead of solving the estimating equations directly, we estimate by minimizing the following quadratic function

 Qn(β)=\bf gn(β)T{Σn(β)}−1\bf gn(β),

where . We can obtain the optimal solution of by the Newton-Raphson iterative algorithm.

### 2.3 Large sample properties

We present the large sample properties of the proposed estimation method. Throughout, “” denotes convergence in distribution.

###### Theorem 1

Under Conditions C1–C5 in the Appendix, we have that

 n1/2(ˆβGMM−β0)D⟶Np{0,(\bf BT1Σ−11\bf B1+\bf BT2Σ−12\bf B2)−1},

where is the true value of parameter vector , and the definition of , , and are presented in the Appendix.

From the proof of Theorem1, we have that the asymptotic variance of the QIF estimator is . Since

 (\bf BT1Σ−11\bf B1+\bf BT2Σ−12\bf B2)−1≤(\bf BT1Σ−11\bf B1)−1,

the new estimator with auxiliary information is asymptotically more efficient than the QIF one.

In order to make the statistical inference on in the marginal model, we construct test statistic on the basis of the quadratic inference function. Suppose that the parameter vector could be decomposed as , where is dimensional and is dimensional. Suppose that we are interested in the hypothesis test , then this hypothesis could be performed based on the following result by treating as a nuisance parameter vector.

###### Theorem 2

Let and be the GMM estimator of with auxiliary information when is fixed at . Under Conditions C1–C5 in the Appendix, we have .

The proof of Theorems 1 and 2 are briefly outlined in the Appendix.

## 3 Simulation Studies

In this section, we conduct a series of simulation studies to examine the performance of our proposed method under finite sample. We consider the following marginal model,

 E(Yij|Xij1=xij1,Xi2=xi2)=β1xij1+β2xi2,i=1,⋯,n;j=1,2,3, (6)

where and . Covariate

is generated from a multivariate normal distribution

, and covariate

is simulated from a Bernoulli distribution taking a value of 0 or 1 with probability

. We obtain the response vector from a multivariate normal distribution with mean vector and covariance matrix , where .

In each case, we estimate and by the QIF and GMM with auxiliary information method, respectively. All simulation results are based on

replications, which include the bias (Bias), the standard deviation (SD), the standard error (SE) and the empirical coverage probability (CP).

In order to test the influence of the working correlation matrix on the proposed method, we consider two common types of and the working correlation matrix, which contains the compound symmetry (CS) structure and the first-order autoregressive correlation (AR(1)) structure. To covariate , we assume that has CS structure with , i.e., , where and represent the identity matrix and -vector of ones respectively. The inverse of the working correlation matrix with CS structure can be decomposed as , where and , is a nuisance parameter and is a matrix with on diagonal as well as off diagonal. Under the AR(1) assumption, the inverse of working correlation can be approximately expressed as by omitting an unimportant matrix with on and , and elsewhere. In the above expression, , and is a matrix with on two main off-diagonals and elsewhere. We then divide all the subjects into two subgroups by the value of covariate , which have the follow form,

 Ω∗1={X2:X2=1}, Ω∗2={X2:X2=0}.

By the property of multivariate normal distribution, . Substituting by its true value, the auxiliary information of the two groups are and , respectively. Choosing the sample size , and , the simulation results are presented in Table 1.

In Table 1, “GMMAI” represents the estimator obtained by the GMM method with auxiliary information and . The results show that both the QIF and GMM incorporated auxiliary information methods perform well: the biases are very small, the SDs are close to the SEs, which are achieved by the asymptotic variance formula, and the CPs generally match the nominal level . The QIF and GMMAI estimators are more efficient when the the working correlation matrix is correctly specified than misspecified. However, the difference is not significate. Furthermore, when incorporated the auxiliary information, the estimators of are nearly of the same with the QIF ones as and only involve the information about covariate . Whereas, the results of by these two methods are quite different: the SDs of by the GMM method with auxiliary information are only about to those by the QIF method in all cases, which shows that the efficiency of parameter estimation can be improved largely when considering the auxiliary information. Since whether the correlation matrix is specified correctly has little influence on the estimation results, we just consider the cases with correct specified in the following simulation studies.

Now, we study the effect of auxiliary information on estimation efficiency in detail. We consider the values of as well as when grouping the subjects. The obtained subgroups can be summarized as

 Ω1={(\bf X1,X2)∣X11≥0,X2=1}, Ω2={(\bf X1,X2)∣X11<0,X2=1}, Ω3={(\bf X1,X2)∣X11≥0,X2=0}, Ω4={(\bf X1,X2)∣X11<0,X2=0}.

To estimate the auxiliary information, we calculate the mean of Y in each subgroup, and express the auxiliary information as , , and . Once combined and , and respectively, will shrink to and .

Firstly, we consider a simple case when , has CS and AR(1) structure with and , sample size and . We estimate the parameters in model (6) by three methods–QIF, GMMAI2 and GMMAI4, where “GMMAI2” represents GMM estimator with auxiliary information and “GMMAI4” stands for GMM estimator with subgroup information . The simulation results are shown in Table 2. The results of GMMAI2 in Table 2 are similar with that in Table 1, that is only the efficiency of can be improved when we incorporate the auxiliary information . However, when we combine with the estimation procedure, the power of is also improved, at the same time, the SD of is more smaller than GMMAI2 method. For example, when and has CS structure with , the SD of by GMMAI4 is only about to that by QIF and GMMAI2 methods, and the SD of by GMMAI4 method is nearly to that by the QIF and the relative efficiency of by GMMAI4 is about to that by GMMAI2. Once again, these results show that applying auxiliary information effectively can help us improve the efficiency of parameter estimators.

In above simulations, we just use the first component of in making groups. In fact, it is usually very difficult to obtain the information related to all the components of covariate . Thus, we conduct another simulation study to explore the relationship between and the extend of improvement in estimation efficiency when the auxiliary information is only related to part components of . We estimate the parameter in model (6) by QIF, GMMAI2 and GMMAI4 methods, respectively when , has CS structure with , and has CS and AR(1) structures with and . Besides of the Bias, SDs, SEs and CPs, we also calculate the relative efficiency (RE) of the estimated coefficients, which represents the variance ratio of the QIF estimator and GMM estimator with subgroup information. The results are summarized in Table 3. The table shows that the RE of by the GMMAI4 method is becoming larger with the increasing of . In fact, the larger is, the higher correlation among the components of is. In this case, even though the auxiliary information only be connected to , it also involves the information of the other two components in . Thus, the power of estimator be improved to a larger extend.

Finally, we study the impact of the auxiliary information on the power of hypothesis about parameters and in model (6). and have CS structure with and . When , we generate data from and . We assume that and with and

. We calculate the type I errors and test power by QIF, GMMAI2 and GMMAI4 methods, respectively. Table

4 is presented the simulation results. The table shows that all type I errors are close to the nominal value , which indicates that all testing methods perform well. When incorporating subgroup information and , the power of hypothesis will be improved largely. The power of GMMAI2 is more than times to that of the QIF when . However, if we use all auxiliary information during the procedure of hypothesis test, the power of and will be improved at the same time. For example, when and , the power of GMMAI4 are respectively about and times to the QIF one. The results show that the auxiliary information can help us improve not only the efficiency of parameter estimation, but also the power of hypothesis test.

In order to examine the conclusion in Theorem 2, we plot the QQ-plot of and under and by QIF, GMMAI2 and GMMAI4 methods, respectively, which are presented in Figure 1 and 2

. In these figures, the sample quantiles show linear relationship with the theoretical ones, which is consistent with the theoretical conclusion in Theorem

2.

## 4 Real Data Analysis

As an illustration, we applied the proposed methods to the Early Childhood Longitudinal Study, Kindergarten Class of 1998-99 (ECLS-K) database, which contains the gender and longitudinal observations of

children’s reading, math and science ability scores at seven time points. Those ability scores were obtained from the Item Response Theory (IRT) study, which can be used to measure a child’s underlying ability. We study the influence of reading ability, math ability and gender on children’s science ability through the observations measured at Grade and . Deleting the subjects with missing data, the sample size is . For the convenience of analysis, we standardize all the ability scores firstly. Letting the science ability score be the response, and gender, reading as well as math ability scores be covariates, which are denoted by , (1-male,0-female), and respectively for ; . By correlation analysis, we found that different components of Y are highly correlated. So we consider the following marginal model,

 E(Yij∣\bf Xij=\bf xij)=β1xi1+β2xij2+β3xij3.

To obtain the auxiliary information, we divide the subjects into subgroups. Here we try three kinds of grouping manners. First of all, we group the data by the value of gender and math ability scores in Grade . The subgroups can be written as

Combing and , and respectively, we obtain two subgroups that are only related to the value of , denoted as and . In the second case, we separating the subjects into groups by the reading and math ability scores in Grade . The corresponding subgroups are

 ΩII1={(\bf X2,\bf X% 3):X21≥0,X31≥0},ΩII2={(\bf X2,\bf X3):X21<0,X31≥0},

Similarly, we could obtain two subgroups and , which are only connected with , by merging and , as well as and respectively. Finally, we try to use the reading ability scores in Grade and to make groups. The obtained subgroups take the forms of

 ΩIII1={(\bf X2):X21≥0,X23≥0},ΩIII2={(\bf X2):X21≥0,X23<0}, ΩIII3={(\bf X2):X21<0,X23≥0},ΩIII4={(% \bf X2):X21<0,X23<0}.

If we incorporate and , and respectively, two groups based on the value of are obtained, which can be written as and . In order to use the proposed method, we randomly sample a subset of sample size from the original complete data for our analysis, and the left data are used for estimating the auxiliary information. We estimate the parameters in marginal model by QIF, GMMAI2 and GMMAI4 methods under different grouping manners. In each case, “GMMAI2” represents the GMM estimator with subgroups information and “GMMAI4” stands for GMM estimator with subgroups information. For example, “GMMAI2(I)” stands for GMM estimator with auxiliary information provided by subgroups . The analysis results are presented in Table 5.

The table shows that the estimates of parameters by different methods are very similar. are smaller than , which indicates that, to children with similar reading and math ability, a girl’s science ability is higher than a boy’s. As we all known, the development of girl’s intelligence is earlier than that of boy’s, so this result is reasonable. Moreover, we observe that the estimated coefficients of reading and math ability are larger than , which illustrates that these two kinds of ability have positive effects on children’s science ability. A good reading ability could help children understand new things easily, while outstanding math ability does good to cultivate children’s logical thinking ability. So all of them are helpful to promote children’s science ability. Finally, the SEs of the estimated parameters obtained by the GMM method with subgroup information are smaller than that by the QIF, which illustrates that the auxiliary information can improve the estimation efficiency. This result is consistent with our theoretical results.

## 5 Discussion

In this paper, in order to improve the efficiency of the estimated coefficient in the marginal model, we proposed a kind of GMM procedure with auxiliary information. The asymptotic properties of the proposed estimators have been established. The simulation studies and real data analysis show that our proposed estimators are more efficient than the one obtained by the conventional QIF method. However, we just considered the application of the auxiliary information in marginal model based on a complete data set. It is of interest to explore how to apply the auxiliary information to some more complicated cases, such as the missing data, in further. In addition, we just consider the case that the auxiliary information is consistent with the sample data set we researched. It is meaningful to study how to use the auxiliary information, which is inconsistent with the data set we are interested in.

Appendix
For a vector or matrix , denotes the -norm of . We impose the following regularity conditions that are needed to establish the asymptotic properties of the estimators. Throughout, “” represents converge in probability.

1. There exists a unique in a compact space, which satisfies .

2. and are positive definite and finite.

3. Matrix-valued function is second continuously differentiable with respect to and is uniformly bounded up to the second order partial derivatives, where is a diagonal matrix with each entry as the marginal conditional variance of the response, .

4. The matrix is second continuously differentiable with respect to , and there exist a matrix which is continuous and positive definite at such that

 supβ∈Vε,β0∥Σn(β)−Σ(β)∥=op(1),

where is a neighborhood of .

5. The vector valued function is second continuously differentiable with respect to , and there exist a matrix which is continuous at such that

 supβ∈Vε,β0∥\bf g% (n)(β)−\bf g(β)∥=op(1).

We define

 \bf Gn(β) =∂\bf g(n)(β)∂βT=⎛⎜ ⎜ ⎜ ⎜ ⎜⎝∂\bf S(n)(β)∂βT∂Ψ(n)(β)∂βT⎞⎟ ⎟ ⎟ ⎟ ⎟⎠, \bf W(l)(μi,β) Φn(β) =diag(\bf g(n)(β)T,⋯,\bf g(n)(β)T), ˜Ωn(β) =diag(∂Σ−1n(β)