# Grouped Heterogeneous Mixture Modeling for Clustered Data

Clustered data which has a grouping structure (e.g. postal area, school, individual, species) appears in a variety of scientific fields. The goal of statistical analysis of clustered data is modeling the response as a function of covariates while accounting for heterogeneity among clusters. For this purpose, we consider estimating cluster-wise conditional distributions by mixtures of latent conditional distributions common to all the clusters with cluster-wise different mixing proportions. For modeling the mixing proportions, we propose a structure that clusters are divided into finite number of groups and mixing proportions are assumed to be the same within the same group. The proposed model is interpretable and the maximum likelihood estimator is easy to compute via the generalized EM algorithm. In the setting where the cluster sizes grows with, but much more slowly than, the number of clusters, some asymptotic properties of the maximum likelihood estimator are presented. Furthermore, we propose an information criterion for selecting two tuning parameters, number of groups and latent conditional distributions. Numerical studies demonstrate that the proposed model outperforms some other existing methods.

## Authors

• 22 publications
• ### Increasing Cluster Size Asymptotics for Nested Error Regression Models

This paper establishes asymptotic results for the maximum likelihood and...
01/22/2021 ∙ by Ziyang Lyu, et al. ∙ 0

• ### Estimating the size of a closed population by modeling latent and observed heterogeneity

We describe a new class of capture-recapture models for closed populatio...
06/07/2021 ∙ by Antonio Forcina, et al. ∙ 0

• ### A Conway-Maxwell-Multinomial Distribution for Flexible Modeling of Clustered Categorical Data

Categorical data are often observed as counts resulting from a fixed num...
11/05/2019 ∙ by Darcy Steeg Morris, et al. ∙ 0

• ### Real Elliptically Skewed Distributions and Their Application to Robust Cluster Analysis

This article proposes a new class of Real Elliptically Skewed (RESK) dis...
06/30/2020 ∙ by Christian A. Schroth, et al. ∙ 0

• ### Mixtures of multivariate generalized linear models with overlapping clusters

With the advent of ubiquitous monitoring and measurement protocols, stud...
11/20/2019 ∙ by Saverio Ranciati, et al. ∙ 0

• ### Grouped GEE Analysis for Longitudinal Data

Generalized estimating equation (GEE) is widely adopted for regression m...
06/11/2020 ∙ by Tsubasa Ito, et al. ∙ 0

• ### Exact log-likelihood for clustering parameterised models and normally distributed data

Taking a model with equal means in each cluster, the log-likelihood for ...
08/10/2020 ∙ by Anthony J. Webster, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Clustered data which has a grouping structure (e.g. postal area, school, individual, species) appears in a variety of scientific fields. The goal of statistical analysis of clustered data is modeling the response as a function of covariates while accounting for heterogeneity among clusters. The standard approach for analyzing clustered data is using mixed models or random effect models (e.g. Demidenko, 2013; Jiang, 2007)

. However, since this approach essentially aims at modeling conditional means in each cluster, it would not be plausible when the data distribution is skewed or multimodal. As an alternative modeling strategy, the finite mixture model

(McLachlan and Peel, 2000) has been extensively applied for its flexibility to capture distributional relationship between response and covariates. For independent (non-clustered) data, the mixture model with covariates was originally proposed in Jacobs et al. (1991), known as mixture-of-experts, and a large body of literature has been concerned with flexible modeling of the conditional distribution of non-clustered data, e.g. Jordan and Jacobs (1994); Geweke and Keane (2007); Chung and Dunson (2009); Villani et al. (2009, 2012); Nguyen and McLachlan (2016). However, these methods cannot be directly imported into the context of analyzing clustered data. One possible direct adaptation of these mixture models into modeling cluster-wise conditional distributions is to apply the models to the whole dataset without taking account of cluster information (we call global mixture modeling), which produces a result that all the estimated cluster-wise distributions are the same. Nevertheless, such result would be inappropriate and the relationship between response and covariates might be misunderstood. Another possible adaptation is fitting the mixture models separately to each cluster (we call local mixture modeling), so that the cluster-wise heterogeneity would be addressed. However, since it is often the case that cluster sizes are not large in practice, the estimation might be unstable.

For modeling cluster-wise conditional distributions, Rubin and Wu (1997) and Ng and McLachlan (2014) proposed a mixture model of random effects models, but it would be computationally burdensome since the estimation of a single random effects model is still presents a major challenge for the non-normal responses (e.g. Hui et al., 2017b). Moreover, the result from mixture of mixed models would be difficult to interpret. An alternative approach is latent mixture modeling (Sugasawa et al., 2017) that can be regarded as a compromised model of global and local mixture models, in which the cluster-wise conditional distributions are expressed as the following mixture:

 fi(y|x)=L∑k=1πikhk(y|x;ϕk),    i=1,…,m, (1)

where is the number of clusters, is a latent conditional distribution common to all the clusters, which typically comes from generalized linear model with parameter , and is the cluster-wise mixing proportions. In the above model, cluster-wise heterogeneity is captured by the mixing proportions , so that the model would be easy to interpret. For modeling , Sun et al. (2007) proposed the use of a logistic mixed model while Sugasawa et al. (2017)

proposed the use of a Dirichlet distribution. In both models, due to the additional stochastic structure, parameter estimation requires numerically intensive methods such as Markov Chain Monte Carlo or Monte Carlo EM algorithm as used in

Sun et al. (2007) and Sugasawa et al. (2017)

, respectively. Apart from parametric modeling of cluster-wise conditional distributions,

Rosen et al. (2000) and Tang and Qu (2016) proposed a mixture modeling based on the generalized estimating equation, but the primary interest in these works is estimation of the component distributions in the mixture by taking account of correlations within clusters. Moreover, in nonparametric Bayesian approach, hierarchical Dirichlet processes (Teh et al., 2006) might have similar philosophy to latent mixture modeling in that both methods assumes sharing mixing components among clusters, and Rodoriguez et al. (2008) is also concerned with modeling cluster-wise conditional distributions in terms of Bayesian nonparametrics. Hence, most of modeling strategies of cluster-wise conditional distributions are computationally intensive or have poor interpretability. In other words, approaches to estimating cluster-wise conditional distributions while accounting for cluster-wise heterogeneity are still limited.

The purpose of this paper is to introduce an effective mixture modeling called grouped heterogeneous mixture (GHM) modeling that is easy to fit and interpret. We develop our method in the framework of latent mixture model (1), and introduce a novel structure for the mixing proportions that can be divided into finite number of groups and ’s within the same groups have the same mixing proportions. Specifically, let , be a group membership variable such that , where is the number of groups, and be unknown mixing proportions in the th group. Then the proposed structure of is . Since we do not know how the clusters are divided into groups, we treat the group membership variable as an unknown variable (parameter), and estimate it based on data. Owing to the simple structure of , the likelihood function of the GHM model can be easily obtained, and the maximization of the function can be easily carried out by the generalized EM algorithm (Meng and Rubin, 1993) as will be described in Section 2. Note that the estimation of group membership is closely related to k-means algorithm (Forgy, 1965), and the idea of grouping parameters have been recently considered in the context of panel data analysis (e.g. Hahn and Moon, 2010; Bonhomme and Manresa, 2015; Ando and Bai, 2016), but this paper is the first one to use the idea for modeling cluster-wise mixing proportions.

Since the number of group membership variable grows with the number of clusters, the cluster size should also grow with the number of clusters to consistently estimate . Nevertheless, owing to the grouping structure, the model parameters including are shown to be consistent under the setting where the cluster size grows considerably more slowly than the number of clusters. Specifically, for being the minimum cluster size among clusters, our asymptotic framework is and for some , which would be met in various applications.

The rest of paper is organized as follows. In Section 2, we introduce the GHM modeling, describe the generalized EM algorithm, and propose BIC-type information criterion for selecting numbers of latent distributions, , and groups . In Section 3, we illustrate the asymptotic properties of the estimator of model parameters. In Section 4, we carry out numerical experiments to demonstrate the performance of the GHM modeling relative to global and local mixture models or random effects models, based on simulated data as well as real-world data.

## 2 Grouped Heterogeneous Mixture Modeling

### 2.1 The model

Suppose that we have the clustered observations , , , with an associated

-dimensional vector of covariates

. Let

be a density or probability mass function of

given , which are assumed to be the same within clusters but different across clusters. Our primary interest is estimating the cluster-wise conditional density from the data. To this end, we propose the following grouped heterogeneous mixture (GHM) model:

 fi(y|x)=L∑k=1πgikhk(y|x;ϕk),    i=1,…,m, (2)

where ’s are latent conditional densities common to all the clusters, denotes group membership variable and is a vector of unknown parameters in the th latent densities. As mentioned in Section 1, the key notion in the GHM model (2) is to assume that clusters can be divided into groups (sub-clusters) and the same mixing proportions (the same conditional density) holds within the same groups. The latent model

should be specified by the user depending on the type of response variable

, and generalized linear models would be a typical choice. For instance, we may use a normal density with mean

and variance

when the response takes continuous values, or a Bernoulli distribution with success probability

when the response is binary.

We treat the group membership variable as an unknown parameter and estimate it based on the data, which means that clusters are adaptively grouped by reflecting information of the data. Moreover, in the GHM model (2), the heterogeneity among clusters can be expressed by the mixing proportions , so that the GHM model would be flexible yet parsimonious, so that the estimated results would be interpretable. There are three types of unknown parameters in (2), being the mixing proportions in each group, being the set of group membership variables, and being the structural parameters in the latent densities.

### 2.2 Parameter estimation via generalized EM algorithm

Let be the set of unknown parameters. We first consider estimating under known numbers of groups and latent components . The estimation of and will be given in the end of this section. Given the data, the log-likelihood function of is obtained in the following closed form:

 Q(Θ)=m∑i=1ni∑j=1log(L∑k=1πgikhk(yij|xij;ϕk)). (3)

The maximum likelihood estimator of is defined as the maximizer of . As is often the case in the context of mixture modeling, we develop an iterative method for maximizing with respect to . Specifically, we use the generalized EM (GEM) algorithm (Meng and Rubin, 1993). To this end, we consider the following hierarchical expression of the model (2):

 yij|zij=k∼hk(yij|xij;ϕk),      P(zij=k)=πgik, (4)

where ’s are latent variables representing the membership of ’s. Given ’s, the complete log-likelihood is given by

 Lc(Θ)=m∑i=1ni∑j=1L∑k=1I(zij=k){loghk(yij|xij;ϕk)+logπgik}.

In the E-step, we need to compute expectations of the latent variables ’s. Under the hierarchical expression (4

), the conditional (posterior) probability of

given observations evaluated at is given by

 P(zij=k|Y;Θ(r))≡w(r)ijk=π(r)gikhk(yij|xij;ϕ(r)k)∑Lℓ=1π(r)giℓhℓ(yij|xij;ϕ(r)ℓ),

so that the objective function to be maximized in the M-step is obtained as

 Q(Θ|Θ(r))=L∑k=1m∑i=1ni∑j=1w(r)ijkloghk(yij|xij;ϕk)+m∑i=1L∑k=1logπgikni∑j=1w(r)ijk≡Q1(ϕ|Θ(r))+Q2(γ,π|Θ(r)). (5)

It is observed that the maximization of with respect to can be divided into maximization problems and can be separately updated. On the other hand, maximizing includes discrete optimization of on the space , which would be slightly complicated. Hence, instead of simultaneously maximizing with respect to and , we first maximize with respect to and we set to the maximizer, and then, maximize with respect to to get . This updating process guarantees that monotone increasing of the objective function, that is, . The proposed GEM algorithm is summarized in what follows.

GEM algorithm

• Set the initial values and .

• (E-step)   Compute the following weights:

 w(r)ijk=π(r)gikhk(yij|xij;ϕ(r)k)∑Lℓ=1π(r)giℓhℓ(yij|xij;ϕ(r)ℓ).
• (M-step-1)   Solving the following maximization problem and set :

 ˆϕk=argmax {m∑i=1ni∑j=1w(r)ijkloghk(yij|xij;ϕk)},    k=1,…,L.
• (M-step-2)   Update and as follows:

 π(r+1)gk=(∑i:g(r)i=gni)−1∑i:g(r)i=gni∑j=1w(r)ijk,     g=1,…,G,   k=1,…,L,
 g(r+1)i=argmaxg=1,…,G {L∑k=1logπ(r+1)gkni∑j=1w(r)ijk},    i=1,…,m,
• If the algorithm has converged, the the algorithm is terminated. Otherwise, set and go back to Step (2).

Note that updating requires maximizing the weighted log-likelihood of th latent distribution, which can be easily and efficiently carried out as long as the latent density is familiar e.g. generalized linear models. Moreover, for updating , we simply compute all the values of the objective function for and select the maximizer. Therefore, all the M-steps in the GEM algorithm are easy to execute, thereby the GEM algorithm is computationally quite easy to carry out.

Finally, for selecting the values of (number of groups) and (number of latent components), we adopt the following BIC-type criterion:

 BIC(G,L)=−2Q(ˆΘ)+logN{qL+G(L−1)+m},

where is the number of whole samples, and . The suitable values of and are set to the minimizer of .

## 3 Asymptotic Properties

We investigates the large-sample properties of the maximum likelihood estimator when the cluster-sizes grow with the number of clusters. Without loss of generality, we suppose that the cluster labels are assigned such that the cluster-size increases with the cluster index, that is, . Moreover, we assume that as well as as considered in literatures (e.g. Vonesh et al., 2002; Hui et al., 2017a) regarding estimation of random effects models. However, we consider asymptotics under the setting and for some , that is, the cluster-sizes are allowed to grow much more slowly than , which would be met in various applications. In what follows, we further assume that the number of groups and the number of latent conditional distributions are known. Under the setting, we define and be the true parameters of and . We now present asymptotic properties of the maximum likelihood estimator, whose proof is given in Appendix.

###### Theorem 1.

Under Assumptions given in Appendix and and for some , it holds that

 1mm∑i=1L∑k=1(ˆπˆgik−π0g0ik)2=Op(1m) (6) √mFm(ˆΘ)1/2(ˆψ−ψ0)→N(0,Idim(ψ)), (7)

where is the Fisher information matrix defined in Appendix.

Although the number of cluster-wise mixing proportions grows at the number of clusters , (6) shows that they are consistently estimated and the convergence rate depends on . Moreover, (7) shows that the structural parameters in latent distributions and group-wise mixing proportions whose dimensions are fixed regardless of and are -consistent.

## 4 Numerical Studies

### 4.1 Simulation study: continuous response

We investigate the finite sample performance of the proposed method together with some existing methods. We consider clusters and assume that each cluster () has the following cluster-wise conditional distribution:

 fi(y|x1,x2)=πiϕ(y;αi0+αi1x1+αi2x2,1)+(1−πi)ϕ(y;βi0+βi1x1+βi2x2,(1.5)2),

where is a mixing proportion and denotes a normal density function with mean and variance . We generated

from the uniform distribution on

and from Bernoulli distribution with success probability . We consider the following two cases of the mixing proportion :

 (A1)   πi∼Beta(2,1),       (A2)   πi∼TP(0.2,0.5,0.8),

where denotes an uniform three-point distribution on and . The mixing proportion varies across all the clusters in the case (A1) while ’s are divided into three groups and clusters within the same groups have the same mixing proportions in the case (A2). Moreover, we consider the following two cases of the structural parameters :

 (B1)   αi0=αi2=−1,   αi1=2,   βi0=βi2=1,   βi1=−2, (B2)   αi0,αi2∼N(−1,(0.2)2),   βi0,βi2∼N(1,(0.2)2), αi1∼N(2,(0.2)2),   βi1∼N(−2,(0.2)2).

Note that the structural parameters in the mixing distributions are the same over all the clusters in the case (B1), but the mixing distributions are different over all the clusters in the case (B2). We consider the following 4 scenarios produced by all the combinations of the cases of and :

 scenario1234case (πi)(A1)(A2)(A1)(A2)case (Θi)(B1)(B1)(B2)(B2)

From the conditional distribution, we generated samples in each cluster, and we considered and . For the simulated dataset, we applied the proposed grouped heterogeneous mixture (GHM) model as well as global mixture (GM), local mixture (LM) and random effects (RE) model. In the GHM model, normal regression models are used for latent conditional distributions and the suitable numbers of groups and latent components are selected by BIC among combinations of and . In the GM and LM models, we consider mixtures of normal regressions and the number of components is selected by BIC from . For the RE model, only a random intercept term is included. For fitting GM, LM and RE, we use the maximum likelihood estimators of model parameters. Based on these methods, we construct an estimator of cluster-wise conditional densities , and compute the following mean integrated squared errors (MISE):

 MISE(x2)=1mm∑i=1∫0.8−0.2∫∞−∞{ˆfi(y|x1,x2)−fi(y|x1,x2)}2dydx1,

which is approximated by finite sum over equally-spaced grid points from to for and from to for . We repeated simulating a dataset and computing the MISE for 500 times. Since the results of and are quite similar, we only report the boxplots of in Figure 1, which shows that GHM clearly outperforms the other methods in both all the scenarios. It should be noted that LM does not work well in spite of its flexibility in almost all the scenarios . This is because fitting a mixture model based on moderate cluster sizes like or would be unstable.

### 4.2 Simulation study: binary response

We next investigate the performance when the response is binary. To this end, we consider the following cluster-wise conditional distributions:

 fi(y|x1,x2)=πiBer(y;μ1i(x1,x2))+(1−πi)Ber(y;μ2i(x1,x2)),

where denotes a Bernoulli probability function with success probability , and . The two covariates and are generated in the same way as the previous section, and we considered the same four scenarios regarding the settings of the mixing proportions and distributions. Based on the above cluster-wise conditional distribution, we generated and samples in each cluster. For the simulated dataset, we applied the same four methods as considered in Section 4.1. We used a binomial regression as latent or mixing distributions in GHM, GM and LM, and a logistic mixed regression model as RE. Similarly to Section 4.1, we competed the following mean integrated squared errors (MISE):

where the integral is approximated by a finite sum over equally-spaced grid points from to . In Figure 2, we show boxplots of based on 500 replications in four scenarios. We can observe similar results that GHM outperforms the other methods in all the scenarios.

### 4.3 Butterfly data

We consider an application of the proposed method to Butterfly data (Oliver et al., 2006). The dataset is consists of abundance of Butterflies at sites in Boulder County Open Space in the years 1999 and 2000. For each site, geographical information (longitude and latitute), habitat characteristics (grassland type and quality) and landscape context (percentage surrounding urbanization) are available, so that we use the information as covariates. Covariate taking continuous values were scaled to have mean and variance . By omitting species being presence in smaller than three sites, there were species. We let be binary outcome representing presence () and absence () of species in the site with and .

For the dataset, we applied the following GHM model:

 yis∼L∑k=1πgikBer(logit(αk+xtsβk)),

where , denotes a Bernoulli distribution with success probability and is a vector of covariates in the site . The number of group and latent conditional distributions were selected by BIC-type criterion over and , and and were selected. This means that species can be divided into

groups and each group-wise conditional distribution is expressed as a mixture of three logistic regression models. In Table

1, estimates of regression coefficients in latent 3 logistic regression models are reported, which shows that the estimates are quite different over the models. Moreover, in Table 2, estimates of mixing proportions in groups are shown. Although the estimates of mixing proportions are slightly similar in group 2 and group 5, the mixing proportions are entirely very different over groups, and the difference among groups can be represented in the mixing proportions.

For comparison of predictive performance, or sites are randomly omitted to act as test data. Then, we applied the four methods, GHM, GM, LM, RE, to the reaming data (training data) to construct a predictive model for the test data. Similarly to Section 4.2, we considered a mixture of logistic regression models for GM and LM, and a logistic random intercept model as RE. The number of mixture components in LM and GM were selected by BIC. To asses the predictive performance, we computed likelihood of the test data (we call test likelihood) based on the estimated model. This procedure was repeated for times, so that such test likelihood are computed for each replication and method. We found that the test likelihood of LM were very small compared with other methods possibly because cluster size () is not so large and there are 7 covariate. Moreover, proportions of presence are not small in some species, which may lead to unstable results of fitting a mixture model to each species. For the other methods, we show the test likelihood in each replication in Figure 3. We can observe that the test likelihood of GHM is substantially larger than those of the other methods in almost all the replications, which shows that better prediction performance of GHM than the other methods.

## 5 Final remarks

Estimation of cluster-wise conditional distributions is a powerful tool to reveal the distributional relationship between response and covariates while accounting for heterogeneity among clusters. For this purpose, we proposed a novel mixture modeling called grouped heterogeneous mixture (GHM) modeling for estimating luster-wise conditional distributions. The estimation of the GHM model is easily carried our by using the generalized EM algorithm. Under the setting where the cluster size grows with, but considerably slowly than, the number of clusters, we revealed asymptotic properties of the maximum likelihood estimator.

In this work, we were not concerned with a problem caused by the dimension of the structural parameters in the latent distributions. Such problem is typically related to the number of covariates, and the proposed method might breakdown when the number of covariates are very large. In this case, variable selection or regularization is needed. In the framework of mixture modeling, there are several works concerned with variable selection via regularization (penalization) methods (e.g. Hui et al., 2015; Khalili and Chen, 2007). The adaptation of these methods into our framework would be a valuable future study.

Acknowledgements     This work was supported by JSPS KAKENHI Grant Numbers 16H07406.

Appendix: Proof of Theorem 1

We require the following regularity conditions:

• The density (probability) function is identifiable in up to the permutation of the component and grouping labels.

• The Fisher information matrix

 Fm(Θ)=E[(∂∂ψlogg(Y|X;γ,ψ))(∂∂ψlogg(Y|X;γ,ψ))t]

is finite and positive definite at and .

• There exists an open subset of containing true parameters , such that there exists functions with

 fi(y|x;γ,ψ)fi(y|x;γ†,ψ)

for arbitrary and . Moreover, , and .

• is an interior point in the compact set .

We define a set of neighborhood of as , where is a scalar, , and denotes the -norm. We first show the following lemma.

###### Lemma 1.

For any and , it holds that

 1mm∑i=1I(ˆgi(ψ)≠gi0)=op(n−δ1).
###### Proof.

From Markov’s inequality, it follows that

 P(1mm∑i=1I(ˆgi(ψ)≠gi0)>εn−δ1)≤nδ1mεm∑i=1P(ˆgi(ψ)≠gi0) (8)

for arbitrary and . Note that

 P(ˆgi(ψ)≠g0i) ≤∑g≠g0iP(ni∑j=1logfi(yij;g,ψ)≥ni∑j=1logfi(yij;g0i,ψ)) =∑g≠g0iP(ni∑j=1logfi(yij;g,ψ)fi(yij;g0i,ψ)≥0).

Moreover, from Markov’s inequality, it holds that

 P(ni∑j=1logfi(yij;g,ψ)fi(yij;g0i,ψ)≥0) ≤{E[exp(12logfi(yij;g,ψ)fi(yij;g0i,ψ))]}ni =exp⎧⎨⎩nilogE⎡⎣ ⎷fi(yij;g,ψ)fi(yij;g0i,ψ)⎤⎦⎫⎬⎭.

Note that

 √1fi(yij;g0i,ψ) =√1fi(yij;g0i,ψ0)+f(yij;g0i,ψ†)−3/2(ψ−ψ0)t∂∂ψfi(yij;g0i,ψ)∣∣ψ=ψ† =√1fi(yij;g0i,ψ0)+f(yij;g0i,ψ†)−1/2(ψ−ψ0)t∂∂ψlogfi(yij;g0i,ψ)∣∣ψ=ψ†,

where lies on the line segment joining and . Therefore, under Assumption A3, we have

 E⎡⎣ ⎷fi(yij;g,ψ)fi(yij;g0i,ψ)⎤⎦=∫ ⎷fi(yij;g,ψ)fi(yij;g0i,ψ)fi(yij;g0i,ψ0)dyij =∫√fi(yij;g,ψ)fi(yij;g0i,ψ0)dyij+(ψ−ψ0)tE⎡⎣(∂∂ψlogfi(yij;g0i,ψ)∣∣ψ=ψ†) ⎷fi(yij;g,ψ)f(yij;g0i,ψ†)⎤⎦ ≤1−H(f(⋅;g,ψ),f(⋅;g0i,ψ0))+Cη,

where denotes the Hellinger distance between and and is a constant. Under Assumption A1, when , so that for sufficiently small , there exist a constant such that

 supψ∈Nηlog{1−H(f(⋅;g,ψ),f(⋅;g0i,ψ0))+Cη}<−c.

Hence, we have

 P(ni∑j=1logfi(yij;g,ψ)fi(yij;g0i,ψ)≥0)≤exp(−cni)=o(n−δ1),

thereby the right side of (8) is , which completes the proof. ∎

We next show the consistency of as given in the following lemma:

###### Lemma 2.

As and , it holds that .

###### Proof.

Define such that and such that . Further, we define

 R(ψ,γ)≡1Nm∑i=1Ri(ψ,gi)=1NQ(Θ).

Note that

 |R(ψ0+ηu,γ†s)−R(ψ0,γ0)| =|R(ψ0+ηu,γ0)−R(ψ0,γ0)|+|R(ψ0+ηu,γ†s)−R(ψ0+ηu,γ0)|=I1+I2.

Under Assumption A2 and A3, using the similar argument given in the proof of Theorem 1 in Hui et al. (2015), it holds that . Regarding , it is noted that

 I2 ≤1Nm∑i=1|Ri(ψ0+ηu,g†is)−Ri(ψ0+ηu,g0i)| =1Nm∑i=1I(g†is≠g0i)|Ri(ψ0+ηu,g†is)−Ri(ψ0+ηu,g0i)|≤sC,

for some . Therefore, for any , there exists a local maximum inside the set , so that the consistency follows. ∎

We define the following objective functions:

 ˆQ(ψ)=m∑i=1ni∑j=1log(L∑k=1πˆgi(ψ)khk(yij|xij;ϕk)).
 ˜Q(ψ)=m∑i=1ni∑j=1log(L∑k=1πg0ikhk(yij|xij;ϕk)),

and we define and as the maximizer of and , respectively. Note that . Note that, from Lemma 1, it holds that

 supψ∈Nη|ˆQ(ψ)−˜Q(ψ)|=op(n−δ1). (9)

Then, from the definition of and , we have

 0 ≤˜Q(˜ψ)−˜Q(ˆψ)={˜Q(˜ψ)−ˆQ(˜ψ)}+{ˆQ(˜ψ)−˜Q(ˆψ)} ≤{˜Q(˜ψ)−ˆQ(˜ψ)}+{ˆQ(ˆψ)−˜Q(ˆψ)}=op(n−δ1),

where the last equality follows from (9) and for large since and are consistent. from (9), so that . Hence, it follows that

 √m(ˆψ−ψ0) =√m(˜ψ−ψ0)+√m(ˆψ−˜ψ)=√m(˜ψ−ψ0)+op(√mn−δ1) =√m(˜ψ−ψ0)+op(1)

under for some , which establishes (7). Moreover, from Lemma 1 and the asymptotic property of , it holds that

 1mm∑i=1L∑k=1(ˆπˆgik−ˆπg0ik