. Yet, it was unknown until recently whether MLEs are consistent even in some simple generalized linear mixed models (GLMMs)
. What complicates proving consistency in some mixed models is the dependence among response variables induced by certain random effect designs. Of course, not all types of dependence between responses are problematic – there is a vast literature on maximum likelihood estimation with dependent observations[1, 7, 14, 15, 28, 30, 33]. But, as we will discuss in more detail below, for some commonly used random effect designs such as those with crossed random effects, existing conditions for consistency of MLEs are hard to verify 
. In a few GLMMs with crossed random effects, consistency has been proved using a novel argument that relates the likelihood for the full data to that of a subset consisting of independent and identically distributed (i.i.d.) random variables, “the subset argument”.
Fundamentally, however, the issue is not unique to GLMMs or even mixed models; any other parametric model appropriate for the same settings may present similar difficulties. Accordingly, it was recognized in the first work on consistency using subsets that the idea has the potential to be extended to more general models. We address this by deriving weaker conditions, based in part on the use of subsets, that are sufficient for consistency of MLEs, without assuming a particular model. They help explain formally what makes the subset argument work, why it is useful in some settings where more classical ones are not, and when it can fail. We illustrate the usefulness of our conditions by proving consistency of MLEs in two multivariate GLMMs (MGLMMs) to which existing theory has not been applied successfully.
To fix ideas, let denote a parameter set,
a joint density for the random vector, and the “true” parameter. Let also and . If is a finite set, then since , a necessary and sufficient condition for consistency of MLEs is that, as ,
When is not a finite set, (1) needs to be amended by a uniformity argument to be sufficient, but the main ideas are the same. There are many ways to establish (1). With i.i.d. observations and regularity conditions, (1
) or stronger results follow from the law of large numbers applied to[6, 9, 10, 31]. More generally, if is a stochastic process, may, suitably scaled, satisfy an ergodic theorem, leading again to (1) under regularity conditions. In the literature on maximum likelihood estimation with dependent observations, it is often assumed that some such limit law holds, either for or its derivatives [7, 14, 15]
, or that the moments ofconverge in an appropriate way [1, 28]. Unfortunately, in many practically relevant settings, it is not clear that any such convergence holds and proving that it does is arguably the main obstacle to establishing consistency of MLEs. Let us illustrate using an MGLMM, commonly considered both in statistics and applied sciences [4, 5, 12, 21, 32].
Let be a matrix of non-stochastic predictors, a non-stochastic design matrix, and a multivariate normal vector of random effects, with mean zero and covariance matrix . For the MGLMM, , for some , , and . The responses are conditionally independent given , with conditional densities in the form
where, for , is the conditional cumulant function, a dispersion parameter, and ensures integrates to one. Conditional independence implies . Several of the responses could be from the same subject, hence the “multivariate”, and they can be of mixed type, some continuous and some discrete, for example.
The dependence among the linear predictors is easily characterized since . The relevant density for maximum likelihood estimation, however, is the marginal density,
where denotes the -dimensional multivariate normal density with mean zero and covariance matrix . The density typically does not admit a closed form expression. Moreover, the dependence among responses it implies is in general less transparent than that among the linear predictors. What we can say in general is that two responses are dependent only if their corresponding linear predictors are. That is, response component and are independent if .
It is convenient if is, upon possible reordering of the responses, block diagonal since in that case the full vector of responses can be partitioned into independent sub-vectors. If these are of fixed length as grows then one is back in the classical setting where the full data consists only of an increasing number of independent vectors. This setting is common to many articles on asymptotic theory in mixed models [13, 24, 25, 29]. Unfortunately, in applications the number of independent response vectors – the number of diagonal blocks in – is often small. For example, Sung and Geyer  note that in the famous salamander data  there are 3 independent vectors, each of length 120. Thus, in their notation there are independent observations, but in our notation there are possibly dependent observations. It seems more reasonable, then, to assume that without also assuming that the response vector consists only of an increasing number of independent sub-vectors. This type of limiting process has, in the context of mixed models, previously only been investigated carefully in special cases that do not allow for predictors or mixed-type responses [17, 23]. To be sure, Jiang’s  general theory does allow for predictors, but the specific applications do not.
The intuition behind the usefulness of the subset argument can be understood by considering the following simple LMM with crossed random effects. Suppose , where , , and are all i.i.d. standard normal, , . It is easy to check that the s cannot be partitioned into independent subsets. However, there are many subsets that, even though there is dependence among them, consist of independent random variables. For example, the two subsets and are dependent, but taken separately they both consist of i.i.d. random variables. The MLE of based on either subset, i.e. a subset sample mean, is consistent as . Intuitively, then, the MLE based on all of the variables should be too. Of course, the subset argument is not needed to prove that in this simple example, but the intuition is the same for models where a direct proof is harder. How to formalize this intuition, without actually having to require the subset components to be either independent or identically distributed, is the topic of Section 2.
The rest of the paper is organized as follows. We develop our theory for consistency of MLEs using subsets in Section 2. These results do not assume any particular model. In Section 3 we apply the theory from Section 2 to two MGLMMs. Section 4 contains a brief discussion of our results and their implications. Many technical details are deferred to the appendices.
2 Consistency using subsets of the full data
Recall that denotes a collection of random variables and let denote a collection of random variables that form a subset of those in , i.e. . We will henceforth call a subcollection of to avoid confusion with other subsets introduced later. The main results in this section give conditions for when subcollections can be used to prove consistency of maximizers of . Unless otherwise noted, all convergence statements are as tends to infinity and the number of elements in a subcollection, , tends to infinity as a function of .
All discussed random variables are defined on an underlying probability space, with the elements of denoted . The parameter set is assumed to be a subset of a metric space . We write, for any and , . For any , denotes its closure and its boundary. We assume the true parameter is the same for all but the joint density of , against a dominating, -finite product measure , can depend on in an arbitrary manner. In particular, our setting allows for a triangular array of responses, , though for convenience we do not make this explicit in the notation.
By being the true parameter we mean that for any measurable in the range space of . That is, expectations and probabilities with respect to are the same as those taken with respect to distributions indexed by . Densities for the subcollection and its components are denoted by in place of ; for example, .
We will use subcollections to establish the following sufficient condition for consistency of maximizers of :
The appeal of using subcollections to prove (2), instead of directly working with the full data likelihood , can be explained using the following lemma.
For every , , and subcollection , -almost surely,
Versions of Lemma 2.1 are well known [17, 18], but Appendix A contains a proof for completeness. From the lemma it follows that if , then by dominated convergence. That is, up to a uniformity argument, can be established by showing that the likelihood of the subcollection converges to zero in probability, outside of a neighborhood of . Uniform versions of that convergence will play a crucial role in our results.
We say that a subset is identified by a subcollection if . If for some sequence of constants , , we call an identification rate.
To understand this definition better, consider the case where the subcollection consists of i.i.d. random variables with common marginal density . Suppose also that there is no for which -almost everywhere. That is, is an identified parameter in the classical sense if we restrict attention to the parameter set . Then, under regularity conditions [10, Theorems 16 and 17], one has and, by a uniform strong law of large numbers,
Using this, it is straightforward to show that is identified by with an identification rate that is exponentially fast in . That is, with i.i.d. components and regularity conditions, the classical definition of an identified parameter implies identification in the sense of Definition 2.1. However, we want to allow for subcollections that do not consist of i.i.d. components, and in that case the classical definition is not as useful. For example, we have independent but not identically distributed components in one of our MGLMMs. In this and more general cases, a parameter could be identified in the classical sense for all sample sizes , but, loosely speaking, the difference between the distributions for indexed by some and that indexed by could vanish asymptotically, preventing from identifying in our sense. Finally, notice also that being identified by is essentially equivalent to MLEs based on with the restricted parameter set being consistent.
We can now be more precise about how to use subcollections to establish (2). The strategy is to first find a subcollection that identifies for every , and then use Lemma 2.1 to get the convergence for the full likelihood in (2). For this strategy to be useful, showing that identifies has to be easier than showing that does since the latter would directly imply (2). That is, one has to be able to pick out a subcollection with more convenient properties than the full data. Our applications in Section 3 illustrate how this can be done.
It is useful to allow for several subcollections , consisting of components, and subsets , . By doing so, different subcollections can be used to identify different subsets of the parameter set. For example, if the parameter set is a product space, as is common in applications, then different subcollections can be used to, loosely speaking, identify different elements of the parameter vector. Assumption 1 makes precise what we need to identify using several subcollections.
For every small enough , there are subsets and corresponding subcollections , , such that and each is identified by with some identification rate , , .
This assumption is somewhat similar to assumptions A2 and A3 made by Jiang , which are also assumptions about parameter identification using several subcollections. However, those assumptions are stated in terms of and
. The fact that we do not have to assume anything about the variances of the log-likelihood ratios is an important improvement. For example, if subcollectionconsists of i.i.d. components, the convergence of is immediate from the law of large numbers, but calculating its variance may be difficult.
For finite parameter sets, Assumption (1) is enough to give consistency of MLEs via Lemma 2.1. For more general cases we also need to control the regularity of the log-likelihood for the full data. The following two assumptions are made to ensure that the uniformity of the convergence detailed in Assumption 1 and Definition 2.1 carries over to , in the sense of (2).
For every and , is -almost surely Lipschitz continuous in on the defined in Assumption 1; that is, there exists a random variable not depending on such that, -almost surely and for every ,
The assumptions give us the convergence in (2) and, consequently, the following lemma.
We give an outline here and a detailed proof in Appendix A. Without loss of generality, we may assume , so there is one subcollection that identifies , for arbitrary, small , with rate . It suffices to prove that . For let be a point in the intersection of and the th ball in the cover of given by Assumption 3. Some algebra and Assumption 2 gives
The right hand side is by Assumption 3, so the expectation of the left hand side is also by dominated convergence, which finishes the proof. ∎
We will use Lemma 2.2 to establish both a Wald-type consistency, meaning consistency of sequences of global maximizers of , and a Cramér-type consistency, meaning consistency of a sequence of roots to the likelihood equations . It follows almost immediately from the lemma that if has a global maximizer , -almost surely for every , then . In particular, if is compact one gets Wald-type consistency with an additional continuity assumption. Since Assumption 2 implies is continuous at every point except possibly , assuming continuity also at the unknown should be insignificant in any application of interest.
Since continuous functions attain their suprema on compact sets, has a maximizer on , -almost surely. By Lemma 2.2 all maximizers are in with probability tending to one, for all small enough . ∎
Though compactness is a common assumption [15, 34], it is sometimes too restrictive or even unnecessary. If , or more commonly , is strictly concave in on a convex , then it is enough to verify the assumptions on a neighborhood of (c.f. Theorem 2.4) to get consistency of the unique global maximizer. However, a global maximizer need not exist even as , or perhaps the assumptions cannot be verified for other reasons. With a few additional assumptions, Lemma 2.2 can then be used to get the weaker Cramér-type consistency, which also only requires verifying assumptions for neighborhoods of .
If for some , is almost surely differentiable in on a neighborhood of an interior for every , and Assumptions 1 – 3 hold with replaced by for all small enough , then, with probability tending to one as , there exists a local maximizer of , and hence a root to the likelihood equation , in , for all small enough .
Since is interior we may assume is small enough that all points of are interior. Almost sure differentiability of implies almost sure continuity. Thus, attains a local maximum on the compact , -almost surely. By Lemma 2.2, with probability tending to one, there are no such maximizers in . Thus, with probability tending to one, there exists a local maximizer in . Since and hence is -almost surely differentiable, any such maximizer must be a root to the likelihood equation . ∎
3 Application to multivariate mixed models
3.1 Longitudinal linear mixed model
The first model we consider is an extension of the variance components model that has been studied previously . In addition to dependence between subjects induced by crossed random effects the model incorporates autoregressive temporal dependence between measurements from the same subject. To make the discussion clearer we assume easy-to-specify fixed and random effect structures. This allows us to focus on the core issues, that is, on how to select subcollections and subsets that can be used to verify the conditions of our theory. Our model includes a baseline mean and a treatment effect. A general fixed effect design matrix could be treated the same way as in our second example, discussed in Section 3.2. Before establishing consistency, we discuss the model definition and how to select appropriate subcollections.
Suppose for subjects , , , and time points we observe the response , where for convenience we assume both and are even. Let the stacked vector of responses be
Recall from the introduction that the MGLMM is specified by the conditional distribution and the distribution of the random effects, . For a linear mixed model we let
be a multivariate normal distribution with meanand covariance matrix , , where the two components of are a baseline mean and a treatment effect, respectively, and denotes the identity matrix. Note, in the notation of the introduction, the dispersion parameter in the conditional distribution is , for all . We treat as a parameter to be estimated and not as known, which is otherwise common in the literature.
Let be a vector of zeros and ones where the th element is one if it corresponds to an observation in time and zero otherwise and let denote an -vector of ones. We take , which corresponds to a treatment being applied in the first half of the experiment. Notice that unless is fixed, which we do not assume, this setup implies the predictors change with . Indeed, as grows, a particular observation can go from being made in the latter half of the experiment to the earlier half. Thus, the responses form a triangular array.
Partition into three independent sub-vectors, , , and , where is a first order autoregressive correlation matrix, , , and . We will use and as crossed random effects, inducing dependence between subjects, and to get temporal dependence within subjects. To that end, let , , and . Then, with , the covariance matrix of the linear predictors is
More transparently, for the elements of , it holds that
The marginal density admits a closed form expression in this example. Specifically, the marginal distribution for is multivariate normal with mean and covariance matrix . Note that the structure of is similar to that of the covariance matrix of the linear predictors just discussed. In particular, there are many zeros in the covariance matrix , i.e. there are many independent observations, but cannot be partitioned into independent vectors.
3.1.1 Subcollection selection
The model definitions imply that , a subset of , where denotes the Euclidean norm when applied to vectors. We write .
Subcollections are selected for the purpose of verifying Assumption 1. The main idea guiding selection is suggested by the fact that identification follows, under regularity conditions, if the subcollection’s log-likelihood satisfies a law of large numbers. We will use such subcollections and require that they together identify in the classical sense. By this we mean that, letting denote the distribution of subcollection implied by parameter ,
With these properties in mind, we take to consist of the vectors . Because these vectors do not share any random effects, they are independent. In fact, they are i.i.d. multivariate normal with common mean and common covariance matrix . Clearly, and are identified in the classical sense by this subcollection, but not . Note that even though the predictors, and hence the distributions, do not change with for this subcollection, it is strictly speaking a triangular array unless is fixed.
To identify the remaining parameters, take to consist of the vectors
These are also i.i.d. multivariate normal, with common mean and common covariance matrix
It is straightforward to check that implies , .
In summary, the two subcollections together identify in the classical sense. Moreover, since both subcollections consist of i.i.d. multivariate normal vectors, their log-likelihoods satisfy a law of large numbers as . With this we are equipped to verify that Assumptions 1 – 3 hold locally, leading to the main result of the section in Theorem 3.4.
The purpose of this section is to verify the conditions of Theorem 2.4. The interesting part of that is to check that Assumptions 1 – 3 hold with replaced by , for all small enough . For this purpose we will first prove two lemmas that roughly correspond to Assumptions 1 and 2. The limiting process we consider is that tends to infinity while can be fixed or tend to infinity with , at rates discussed below. Thus, the statements and are equivalent. We will need the following result which is proved in Appendix B.
If is compact, is continuous in on for every in the support of , , and , then for any there are compact sets such that , , and .
Note, when applying the proposition in the present application, , , and is replaced by . The proposition is useful because the s it gives are compact, as we will see in the proof of the following lemma. This lemma formalizes verification of Assumption 1.
If is an interior point of , then for all small enough there exist subsets and such that ,
-almost surely, , and, consequently;
is identified by with an identification rate for some , .
We give an outline here and a detailed proof in Appendix B. It is easy to check that the requirements of Proposition 3.1 are satisfied with replaced by . By taking the s to be the s given by Proposition 3.1, proving points 1 – 2 is similar to proving that MLEs based on subcollection are consistent if the parameter set is restricted to the compact set , . Since the subcollection components are i.i.d., this is straightforward using classical ideas [10, Theorems 16 and 17]. The only difference from the referenced work is that one subcollection is a triangular array and so we use a different strong law. Point 3 follows from points 1 and 2. ∎
Note that, in this lemma and elsewhere, is a small number that is defined in context whereas always denotes the radius of the neighborhood of we are considering. It remains to verify the assumptions concerned with the regularity of the log-likelihood of the full data. When the log-likelihood is differentiable, Lipschitz continuity follows from the mean value theorem if the gradient is bounded. The following lemma uses that to verify Assumption 2. The resulting Lipschitz constant, i.e. the bound of the gradient, is the same for both and . The lemma also gives a probabilistic bound on the order of this Lipschitz constant as that will be useful when verifying Assumption 3.
If is an interior point of , then for every and small enough there exists a random variable such that, -almost surely,
for some .
) is largely an exercise in bounding the eigenvalues of the covariance matrixand its inverse on interior points of . We are ready for the main result of the section.
If is an interior point of and for some as , then, -almost surely, there exists a sequence of roots to the likelihood equations such that .
We verify the conditions of Theorem 2.4. Fix an arbitrary . Since is interior we may assume is small enough that all points in are interior points of . By Lemma B.3 is -almost surely differentiable on , so is, too. By Lemma 3.2, Assumption 1 holds with what is there denoted replaced by . The identification rate is exponentially fast in , for some . Lemma 3.3 shows that is -Lipschitz on both and , and that for some . This verifies Assumption 2. It remains only to verify that the rate conditions in Assumption 3 hold. The -covering number of the sphere is as [3, Lemma 1]. Thus, since , by picking we can have as , . Our choice of ensures , which is the first rate condition. Since the identification rate is exponential in for both subcollections, we have that for some , which is as since is of (at most) polynomial order in . ∎
Two key components in the proof of Theorem 3.4 are that the subcollections consist of i.i.d. random vectors that identify the parameters and that the gradient of the log-likelihood is of polynomial order in . We expect the same proof technique to work in many other cases. Essentially, all that is needed is that the subcollections grow faster than logarithmically in the total sample size and that the gradient is of polynomial order, no matter how large that order is.
It is possible that the assumption that , , could be relaxed by picking other subcollections that also make use of the variation in the time dimension. It is not trivial, however, since the dependence between any two responses sharing a random effect does not vanish as time between the observations increases. In the next section we examine how predictors and mixed-type responses affect the argument.
3.2 Logit-normal MGLMM
The model we consider in this section is an extension in several ways of the logistic GLMMs for which the technique based on subcollections was first developed . The random effect structures are similar, i.e. crossed, but we have multivariate, mixed-type responses, and predictors. The main ideas for verifying the assumptions of the theory from Section 2 are the same as in our LMM example. However, due to the inclusion of predictors, we use results from empirical process theory in place of the more classical strong laws used for the LMM. Showing existence of appropriate subsets of the parameter space that the subcollections identify also requires more work than with i.i.d. components. As before, we discuss the model definition and subcollection selection before establishing consistency.
Suppose for subjects , , there are two responses, which is continuous and which is binary. The vector of all responses is
For each subject we observe a vector of non-stochastic predictors , the same for both responses. Similarly, is the same for both responses. Let be the linear predictor, , , , where , . We assume that for all . In practice this only rules out the possibility that since our setting allows for the standardization of predictors. The conditional density of the responses given the random effects that we consider is
Given the random effects, is normal with mean and variance 1, and is Bernoulli with success probability – a logistic GLMM. The choice of for all is made for identifiability reasons for the Bernoulli responses, and for convenience for the normal responses. Setting the s to some other known constants does not fundamentally change the results.
Consider two independent vectors, and , and corresponding design matrices and . With and the linear predictors are . Thus, responses from the same subject share two random effects, responses from different subjects with one of the first two indexes in common share one random effect, and other responses share no random effects and are hence independent. The covariance matrix for the linear predictors is easily computed in the same way as in the LMM. The covariance matrix for responses, however, is less transparent. It is for simplicity that we assume in this section that all random effects have the same variance. It is not necessary for our theory to be operational but this simplification shortens proofs considerably and allows us to focus on the main ideas.
3.2.1 Subcollection selection
With predictors the -dimensional parameter set is , a subset of the metric space . The intuition behind the selection of subcollections is that the normal responses should identify the coefficient and the variance parameter . Similarly, the Bernoulli responses should identify the coefficient vector . With that in mind we take , . Both of these subcollections consist of independent but not identically distributed random variables – independence follows from the fact that no components in the same subcollection share random effects. Notice that these subcollections are in practice often triangular arrays since the predictors may need to be scaled by to satisfy
. All responses in the first subcollection have marginal normal distributions and all responses in the second have marginal Bernoulli distributions.
Identification is more complicated than in our previous example. One issue is that there can be many and that give the same marginal success probability for the components in the second subcollection. A second issue is that, since the predictors can change with , classical identification for a fixed does not necessarily lead to identification in the sense of Definition 2.1. Additionally, the approach used in the LMM to find appropriate subsets and by means of Proposition 3.1 only works in general when the subcollection components are i.i.d. Thus, we take a slightly different route to establishing consistency compared to the LMM.
Let denote the minimum eigenvalue of its matrix argument.
If is an interior point of and
then for all small enough there exist and such that ,
, and, consequently;
is identified by with an identification rate for some , .
A detailed proof is Appendix B, we here give the proof idea. Let , for some small . Let be the closure of . The idea is that if is small enough, so that and on , then the distributions of implied by and are different if