Generalized linear mixed models (GLMM) are used today rather often for analysis of non-normal data with random effects. These data are commonly encountered in medical research and across many disciplines. They are particularly convenient for the analysis of categorical, e.g. binomial, Poisson, negative binomial etc. data. An excellent introduction to GLMM’s can be found in .
The research described in this manuscript has been motivated by modeling of clinical trials, especially diabetes clinical trials. In that area, it has been rather common historically to focus on the estimation of the treatment effect. While this is often important for sponsors, regulatory agencies, and the medical community at large, the estimation of treatment group means is also important for healthcare professionals to understand the absolute benefit versus risk. Thus, it is important to identify the appropriate methods to provide good estimators for the group means. Most types of statistical software, such as SAS or R, provide least squares group means. These means are estimated at the mean value of the baseline covariates; of course, for basic linear models, these means are equivalent to the average of individual response estimates for the population studied. Moreover, in this case, the least squares group means are consistent estimators of the true group means. For generalized linear models (GLM), the average of individual response estimates for the given population is not the same as the group mean estimated at the mean value of baseline covariates; moreover, as  pointed out, the latter is not even a consistent estimator of the true group mean.  were probably the first to address the issue of estimating the average of individual response estimates for GLM - type models. They described a possible way of estimating the mean outcome assuming assignment to a particular treatment for each patient in the trial, using each patient’s observed baseline covariates, and averaging these predictions over all patients in the trial. Additional research in this direction has been conducted by . They introduced a new semiparametric estimator of the group mean that is consistent even if the working model is misspecified. Both simple and stratified randomization schemes have also been discussed in detail.
The model used by  and  has a shortcoming in that it does not allow for a subject-wise random effect. As an example, such an effect may be needed to model dependence of observations for the same subject over time when time is present as a covariate. Doing so in this context changes the model considered in  and  from GLM to a GLMM. The estimation of treatment group means for this new model is as important as before and, to the best of our knowledge, has hardly been addressed in the literature so far.
In this paper we investigate a generalized version of the “marginal mean” estimator (using the terminology of ) of  that is adapted to the presence of a random effect in the model in the framework of GLMM. Our approach is frequentist in nature; a Bayesian approach has been attempted in . We also construct several possible confidence intervals for our estimator. The rest of this paper is structured as follows. In Section (2), we introduce our model in a more detailed way and construct a series of confidence intervals for the estimated marginal group mean. Section (3) introduces an alternative approach to inference about group means by constructing a prediction interval. Section (4) describes several simulation studies that illustrate our results. Finally, Section (6) shows excellent behavior of the proposed approach when applied to the real data.
2 Confidence intervals
We assume that there are observations in total. An th group is defined as an “intersection ” of the th treatment, , and th level of another covariate, . The other covariate may be either time or a blocking factor, such as gender. For example, if measurements are taken at time points and , while and correspond to control and treatment groups, respectively, we have four groups corresponding to pairs , , , and . Let the total number of groups be and the number of observations in each group Then, of course, we have
. Moreover, we also assume that the results are also influenced by a baseline variable. An example of a baseline variable may be some important prognostic factor, e.g. whether to use sulfonylurea as a background therapy during the trial of an antidiabetes medication. Thus, the covariate vector consists of the treatment assignment, the second covariate (either time or another covariate, e.g. gender), and the baseline variable. The subject - specific covariate vector will thus be denotedwhere is a group number, and is a subject index that enumerates all of the subjects within the th group. All observations of subjects in the th group are collected together in a vector Finally, there is also a random effect that is subject - specific and so we denote it
where the varianceis unknown. The corresponding true mean value will be denoted . It is related to the linear predictor through the use of inverse link function . Thus, conditionally on , we have the model . For each subject, the corresponding mean is, then, .
Our goal is to estimate the group mean and construct a confidence interval for it. In order to do so, we have to be able to estimate parameters of the GLMM, namely and . In general, when estimating these parameters, it is useful to keep in mind that the logistic, negative binomial, Poisson, and many other generalized linear mixed models, can be derived from a two-parameter canonical exponential family with the density
where and are parameters. is typically called the linear predictor and the scale parameter. Such a model implies that the mean is equal to , and the variance to . In most cases, the variance is a simple function of the mean. For example, for binary data we have the variance equal to , the scale parameter and the distribution in (2.1) simplifies to . In order to avoid confusion when integrating out the random effect, we denote a generic subject-wise random effect The marginal likelihood (where the random effect is integrated out) based on all observations is, then,
is the th conditional log-likelihood.
We obtain estimated fixed effect parameters and an estimated variance by maximizing the marginal likelihood . It is rather difficult, in general, to evaluate the marginal likelihood for a GLMM. Two main approximation methods that are used in practice are either a pseudo-likelihood approach (see e.g. ) or an integral approximations that uses the Gauss-Hermit quadrature or Laplace approximation (see e.g. ). Pseudo-likelihood does not work well for a logistic mixed model when the number of observations for the majority of subjects(locations) is small; moreover, it is also known to perform badly in situations where the distribution of the data comes from a family with more than one parameter (e.g. negative binomial). For a more detailed discussion of this problem see e.g.  p. 142. Since both logistic and negative binomial models are of great importance in clinical trial applications, we will use the integral approximation approach to computation of the marginal likelihood,
Finally, when estimators of the model parameters are available, a natural estimator of the group mean can be defined as
where is the estimate of . In the future, we will also call this estimator an unconditional group mean
. Now, we are going to consider two separate cases: the case of a logistic regression with a random intercept and the case of the negative binomial regression with a random intercept.
2.1 Logistic case
As a first scenario, we assume that the data is binary and that the data is a logistic regression with a Gaussian random effect. In order to construct a confidence interval for the estimated group mean , we need to estimate the variance of the estimated mean . Since the link function will be the logistic one, we have . The expectation of the above can be written down as
and is known as the logistic-normal integral. Since it cannot be obtained in closed form, some approximation is necessary. The most common approximation is probably the one obtained in :
where the constant in (2.5) is approximately equal to . With the above so-called Zeger’s approximation in mind, we suggest estimating using a plug-in estimator where .
Due to (2.3), the variance of the estimated group mean response for th group is
This suggests that it is necessary to estimate variance of individual means and the corresponding covariances in order to be able to construct a confidence interval for . To begin with, we will find the variance of using the multivariate delta method. Note that using Zeger’s approximation allows us to compute the approximate gradient of explicitly. Using the notation , and corresponding , we find that the gradient of with respect to and is , where
Thus, the variance of is approximately equal to
where is the covariance matrix of the estimated parameter vector . We will use the empirical Fisher scoring algorithm to compute maximum likelihood estimates of and ; see , pp. for details. In brief, this algorithm uses the sum of squared score functions , to estimate the information matrix at each iteration; that is, . Therefore, the inverse Hessian matrix , evaluated at the final iteration, provides a consistent estimate of the covariance matrix .
As a second step, we now estimate the covariance . This can be done using the multivariate delta method again. More specifically, we approximate the covariance within th group as
Now that we can estimate the variance of the group mean , we can also construct a confidence interval for it. Recall that where each . We suggest using an approximate large sample confidence interval of the form
that should be used for sufficiently large group size
. In the future, we will refer to this confidence interval as the direct interval. For comparison purposes, we will also consider another confidence interval that is based on the use of delta-method to compute the variance of the logit ofMore specifically, it is not hard to see that the
next, we construct a central limit theorem based interval for the and apply the inverse logit transformation to its end points. The final expression for such an interval is where
The details can be found in the Appendix of . In what follows, we will refer to this interval as the inverse confidence interval.
2.2 Negative binomial case
In this section we assume that the response data
have a negative binomial distribution with the mean parameterand the so-called size . This implies that the mean of the distribution is and the variance is and the constant index . For more details see e.g.  p.198-199. In the negative binomial case, the canonical link function is the natural log; thus, the inverse link function is the exponent and so . Unlike in the logistic case, here we obtain a closed form expression for . To keep notation clear, we denote the estimated mean . In order to obtain a confidence interval for , we need to provide a suitable approximation of the variance of
To do this, we will utilize the following result from .
Let where is a multivariate normal distribution with the mean vector
is a multivariate normal distribution with the mean vectorand the covariance matrix where . Then, the variance .
The Lemma (2.1) will enable us to compute an approximate variance of . For convenience, let us denote . The asymptotic distribution of can be viewed as approximately under the so-called “small-dispersion” assumption as defined in e.g. . Roughly speaking, “small-dispersion” asymptotics implies that the total sample size and the smallest group size as well. The
distribution involved can be viewed as approximately normal for a large number of degrees of freedom that will be the case in a typical study with a large number of subjects. Moreover,will have an asymptotically normal distribution; for an example of such an asymptotic result see e.g.  p. Therefore, each can be viewed as approximately normally distributed. Moreover, the same asymptotic result suggests that the estimator of is asymptotically unbiased and so the mean of the normal distribution of is, approximately, . Using the argument above, we can view each estimator as approximately lognormally distributed. To construct a confidence interval, we will also need to estimate each element of the covariance matrix with This can be done by using the delta method. We denote the vector of parameters and the corresponding vector of estimators Then, the gradient and e.g. the approximate variance while the approximate covariance of any two within th group is The variance of can then be approximated, using the above mentioned result from , as
Now we can obtain, similarly to (2.8), a straightforward large sample confidence interval for the true based on the use of CLT applied to the sum in (2.11). Using the approximate variance of from (2.12), such a large sample confidence interval will take the form
that is approximately correct for sufficiently large group size . In the future, we will also refer to (2.13) as a direct interval. For comparison purposes, we will consider two other confidence intervals as well. The first of these is analogous to the inverse confidence interval introduced earlier for the logistic data case except that the link function is now a log function (and the inverse link is the exponential function). We will refer to this interval as the inverse interval, as before. Its final form is
Finally, the third confidence interval that we propose is based on the idea of approximating a sum of lognormally distributed random variables. For simplicity, we will call this one a lognormal confidence interval. First, recall that . Next, we recall that each estimated and that individual estimated parameters and are approximately normally distributed as remarked earlier. Thus, we can assume that has an approximate lognormal distribution. If we want to be more precise than just using a large sample confidence interval based on CLT, we have to obtain some approximation for the distribution of a sum of a large number of lognormal random variables.
A question of approximating a sum of lognormal random variables acquired substantial importance first because of the need to compute an outage probability in cellular radio systems. This sum does not have a closed form distribution; hence, an approximation is required. Until recently, Schwartz-Yeh method , which is iterative in nature, has been the most used approximation method. An alternative is an extended version of Fenton and Wilkinson methods; see, for example,  and . Both of these methods are based on the idea of approximating the sum of lognormal distributions by another lognormal distribution. We will use just this basic idea in our approach. More precisely, recall that Earlier, we noted that each is approximately lognormally distributed. It is easy to show that if is assumed to be lognormally distributed, then for any positive , is also lognormally distributed with the mean equal to and the variance .
At this point, let us denote and and percentiles of the lognormal distribution with the mean and the variance Then, the lower and upper bounds of the confidence interval for are and
3 Prediction interval for the group mean
As an alternative to constructing a confidence interval for the group mean, one can also consider a prediction interval for the group mean. Let us restate some of the main features of our model. First, remember that is the th response of th subject, and is the subject-wise random effect. Random effects are assumed independent between subjects; at the same time, it is convenient to define an overall vector of random effects where In this setting, of course, the total number of observations is The responses come from a generalized linear model (conditional on the random effects) with linear predictors
here, is a vector of fixed effects, and is a vector of covariates whose values may be different at different response points. Throughout this section, our notation will be somewhat different from the notation used in the Section (2) in order to make the exposition of ideas more straightforward. We denote the conditional mean of observations This mean is connected to the linear predictor through the canonical link function as Note that we use the notation instead of the standard in order to keep it distinct from the estimated marginal mean of the Section (2). Remember also that observations belong to one of groups where each group consists of observations, Of course, the total number of observations can also be expressed as We will also use to denote a set of indices describing observations in the th group. Now, let us define a vector of covariate values for a generic th subject as of dimensionality with the th element being equal to and all others to zero. Let also be a generic vector of fixed covariate values that may or may not be equal to With our model in mind, we will frame our problem as that of inference about linear combinations of the form
Our eventual purpose will be to predict the group mean of the form
and to construct a prediction interval for such a predicted group mean.
In our work, we will use the following system of notation. First, recall that the responses are (conditionally) independent and their conditional density function is of the form
in the above, are known weights, is the dispersion parameter, and canonical parameters are related because (see e.g. ). Let the vector of observations for th subject be Let also be the corresponding covariate matrix of fixed effects and be the vector of ones. Stacking row vectors , on top of each other while repeating each one times produces a complete random covariate matrix of dimensionality Let be the variance function for the generalized linear model defined in (3.1) and (3.4). The diagonal matrix of iterative weights for the th subject is denoted The complete data vector, covariate, and weight matrices are and respectively. It is assumed that the matrix has a full column rank. In addition, we let be the diagonal matrix with on its diagonal. Finally, we also denote the complete parameter vector of the model (3.1)-(3.4) and the subvector of variance components
Any inference about random effects will be based on the conditional likelihood
where is the normalizing expression. An alternative, and often convenient form for the conditional likelihood in (3.5) is where, for convenience, the dependence on and has been suppressed. As is common in mixed model theory, the natural predictor of the random effect is its conditional mean so that the predicted th random effect is defined as In other words, only observations on th subject, contained in the vector , are relevant for prediction of . To make the notation easier, we will omit the parameter vector in the subscript in the future, unless absolutely necessary. Similar approach is usually adopted in most of GLMM literature; see, e.g. .
As a first step, we will obtain prediction variance for First, a natural point predictor for the vector is its conditional mean, Clearly, the point predictor for is Then, our task is to construct a prediction interval for the predicted group mean In practice, of course, the parameter vector is not known and so we will be constructing a prediction interval for a somewhat different predicted group mean
where is a consistent estimator of The estimator (3.6) may also be called a conditional group mean. If the parameter vector is known, a sensible prediction variance for is
Of course, in practice is not known and has to be estimated. Let denote a consistent estimator of Simply substituting into (3.7) does not work since this approach ignores sampling variability associated with Following the idea of , we define the prediction variance of ) as the conditional squared error :
Since and are independent, one can easily show that
where is a nonnegative correction term that accounts for estimation of while the term is the so-called “naive” prediction variance of that would have been used if was known. Let denote the maximizer of that satisfies the equation Also, let denote the matrix of iterative weights for the th subject evaluated at and the complete matrix of iterative weights for all subjects evaluated at Using Laplace approximation as in, for example, , chap. (see also ) one can obtain the following approximation to the “naive” prediction variance of
Now we need to provide a convenient approximation to the second term that accounts for the sampling variability due to parameter estimates. As a first step, we note that
Since we can write the matrix as a combination of two block matrices
The above implies that